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Resistance  Meta- Analysis 


Abstract 

Resistance  training  increases  muscle  strength.  Muscle  strength  gains  are  influenced  by 
program  design.  This  review  attempted  to  identify  design  choices  that  would  be  best 
practices.  A  best  practice  is  a  design  option  that  produces  significantly  better  results  than 
any  other  option.  To  ensure  sensitive  assessments  of  program  design  effects,  statistical 
procedures  adjusted  for  differences  in  program  length,  and  allowed  for  the  repeated 
measures  structure  of  the  study  designs.  Untrained  individuals  benefitted  much  more 
from  training  than  trained  individuals.  Gender  had  little  effect.  Age  effects  differed  for 
men  and  women.  Given  the  impact  of  participant  characteristics  on  the  training  response, 
the  effects  of  different  program  design  facets  were  examined  separately  for  programs 
with  untrained  and  trained  participants.  Periodization,  number  of  sessions  per  week, 
number  of  sets  per  session,  and  intensity  (number  of  repetitions  per  set)  were  significant 
moderators  for  untrained  participants;  sets  per  session  and  intensity  were  significant 
moderators  for  trained  participants.  However,  comparisons  generally  showed  that  no 
single  design  option  was  significantly  better  than  all  others.  The  available  evidence  may 
rule  out  some  design  choices  (e.g.,  a  single  set  per  session),  but  it  is  too  limited  to 
identify  best  practices. 


-l- 
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Introduction 

Resistance  training  literature  reviews  have  shown  definitively  that  training 
increases  muscle  strength  (Falk  &  Tenenbaum,  1996;  Payne,  Morrow,  Johnson,  & 

Dalton,  1997;  Peterson,  Rhea,  &  Alvar,  2004,  2005;  Rhea  &  Alderman,  2004;  Rhea, 
Alvar,  &  Burkett,  2002;  Rhea,  Alvar,  Burkette,  &  Ball,  2003;  Wolfe,  LeMura,  &  Cole, 
2004).  The  same  reviews  have  established  that  strength  gains  are  influenced  by  training 
program  designs. 

With  the  effectiveness  of  resistance  training  well  established,  attention  shifts  to  a 
different  question:  Is  it  possible  to  identify  best  practices  for  resistance  training?  A  best 
practice  is  a  specific  program  design  option  that  is  superior  to  all  other  possible  choices 
for  that  program  design  facet.  For  example,  the  number  of  sets  in  each  training  session  is 
a  program  design  facet.  If  the  cumulative  research  record  indicated  that  three  sets  per 
session  produced  significantly  better  results  than  any  other  choice  for  this  facet,  then 
three  sets  per  session  would  be  a  best  practice. 

This  review  attempted  to  identify  best  practices  based  on  the  available  evidence. 
Two  differences  from  past  reviews  were  introduced  to  maximize  sensitivity  to  the 
presence  of  best  practices  if  they  exist.  First,  logic  and  common  experience  lead  to  the 
expectation  that  longer  training  programs  will  produce  greater  training  effects.  In 
previous  meta-analyses  program  length  has  been  treated  as  a  categorical  variable  (e.g.,  6- 
16  weeks  vs.  17-40  weeks)  rather  than  a  continuous  variable  (Rhea  &  Alderman,  2004; 
Payne  et  ah,  1997;  Wolfe  et  ah,  2004).  Important  information  may  have  been  lost  by 
collapsing  program  length  into  categories.  This  study  revisited  the  question,  “Do  longer 
programs  produce  greater  training  effects?” — with  program  length  treated  as  a  continuous 
variable.  An  affirmative  answer  to  this  question  would  raise  the  secondary  question  “Has 
the  failure  to  control  for  differences  in  program  length  distorted  the  relationships  of 
program  design  facets  to  program  effectiveness?”  This  review  introduced  statistical 
controls  for  differences  in  program  length  to  answer  this  question. 

The  second  difference  between  this  review  and  prior  reviews  involved  the 
treatment  of  statistical  issues.  One  set  of  issues  derived  from  the  repeated  measures 
structure  of  the  evidence.  Resistance  training  studies  often  employ  multiple  strength  tests 
to  assess  training  effects.  Each  test  is  an  attempt  to  measure  the  training  program  effects. 
If  the  training  program  produces  a  single  common  effect  for  all  muscle  groups,  the  use  of 
multiple  tests  constitutes  repeated  measurement  of  that  effect.  Steps  must  be  taken  to  deal 
with  the  fact  that  the  results  for  different  tests  are  not  independent  (Gleser  &  Olkin, 

1994). 

A  set  of  related  statistical  issues  arose  from  the  typical  design  of  resistance 
training  studies.  Resistance  training  studies  routinely  employ  repeated  measures  research 
designs.  Strength  tests  are  administered  before  the  training  program  begins  and  again 
after  the  program  has  been  completed.  The  difference  between  the  pre-  and  post-training 
scores  is  the  basis  for  estimating  the  effect  size  (ES)  for  the  training  program.  Steps  must 
be  taken  to  allow  for  this  research  design  when  estimating  the  ES  for  a  study  (Morris  & 
DeShon,  2002). 

A  third  statistical  issue  derived  directly  from  the  current  interest  in  identifying 
best  practices.  It  is  not  enough  to  demonstrate  that  program  design  choices  effect  the  size 
of  the  training  response.  Analysis  of  variance  (ANOVA)  tests  have  been  used  to  test  the 
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hypothesis  that  all  program  design  choices  have  produced  equal  effects.  Rejecting  this 
null  hypothesis  has  only  indicated  that  some  options  differ  from  other  options.  It  is  not 
enough  to  know  that  differences  between  options  exist.  The  existence  of  differences  does 
not  guarantee  the  existence  of  a  best  choice.  For  example,  Rhea  et  al.  (2003)  showed  that 
the  number  of  sets  per  session  affected  the  magnitude  of  the  training  effect  for  trained 
individuals.  Four  sets  per  session  was  identified  as  the  optimal  choice  because  programs 
embodying  that  option  had  produced  that  largest  average  effect,  ES  =  1.17.  However,  the 
ES  for  4  sets  per  session  was  only  trivially  larger  than  the  average  ES  for  5  sets  per 
session,  ES  =  1.15.  The  hypothesis  that  4  sets  per  session  produced  a  stronger  effect  than 
5  sets  per  session  would  be  rejected  unless  the  sample  size  was  very  large.  This  example 
has  illustrated  the  point  that  a  significant  ANOVA  must  be  followed  by  analyses  that 
evaluate  differences  between  specific  program  design  options.  This  review  employed 
post  hoc  comparisons  to  detennine  whether  the  design  option  that  produced  the  largest 
ES  was  truly  a  best  practice. 

This  review  attempted  to  identify  best  practices  for  several  design  facets  of 
resistance  training  programs.  Statistical  methods  were  introduced  to  deal  with  the 
program  length,  the  repeated  measures  structure  of  the  data,  and  the  need  for  post  hoc 
comparisons  to  determine  whether  a  significant  moderator  effect  truly  identifies  a  best 
practice.  No  other  review  to  date  has  dealt  with  all  of  these  issues  or  employed  a  formal 
definition  of  a  best  practice.  As  a  consequence,  this  review  provided  a  different 
perspective  on  the  available  evidence.  This  review  attempted  to  formally  identify  best  for 
the  program  design  facets  of  periodization,  number  of  training  sessions  per  week,  number 
of  sets  per  session,  and  number  of  repetitions  per  set.  The  number  of  repetitions  per  set  is 
a  proxy  measure  for  the  intensity  of  the  training  program.  The  search  for  best  practices 
also  considered  age,  gender,  and  training  status  as  demographic  variables  that  might 
influence  the  impact  of  different  program  design  choices.  The  concern  was  that  best 
practices  might  depend  on  the  type  of  person  being  trained. 


Methods 


Literature  Search  Procedures 

The  literature  search  began  by  identifying  articles  that  contributed  data  to 
previous  resistance  training  meta-analyses  (Falk  &  Tenenbaum,  1996;  Payne  et  al.,  1997; 
Peterson  et  al.,  2005;  Rhea  &  Alderman,  2004;  Rhea  et  al.,  2002;  Rhea  et  al.,  2003; 

Wolfe  et  al.,  2004).  Subsequent  steps  centered  on  a  search  of  the  PubMed  database.  The 
search  terms  “resistance  training  or  weight  training  and  strength”  produced  a  list  of  2,432 
candidate  articles. 

The  candidate  articles  were  separated  into  two  groups  for  further  review.  The  first 
group  consisted  of  1,366  articles  published  between  January  1,  2000  and  May  16,  2007, 
the  time  of  the  search.  The  PubMed  abstract  for  each  of  these  articles  was  examined  to 
detennine  whether  it  met  the  inclusion  criteria  for  this  review.  Articles  were  dropped  at 
this  point  in  the  search  only  if  the  infonnation  in  the  abstract  clearly  indicated  that  the 
study  failed  to  meet  at  least  one  of  the  criteria. 
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The  second  group  of  articles  consisted  of  1,066  articles  published  before  2000. 

The  titles  of  these  articles  were  reviewed.  The  abstract  of  the  article  was  reviewed  only  if 
the  title  suggested  that  the  study  included  an  experimental  evaluation  of  one  or  more 
resistance  training  programs,  and  the  article  had  not  been  included  in  any  of  the  meta¬ 
analyses  cited  in  the  introduction  to  this  paper.  The  review  procedures  for  this  second 
group  of  articles  were  less  intensive  than  those  employed  for  the  first  group  articles.  The 
procedures  were  relaxed  because  it  was  assumed  that  prior  meta-analyses  had  identified 
most  of  the  relevant  studies  conducted  prior  to  2000. 

When  the  abstract  of  a  study  was  reviewed,  the  inclusion  criteria  were: 

1 .  At  least  one  group  in  the  study  had  to  participate  in  a  resistance  training 
program.  In  addition  to  the  usual  resistance  training  program  studies,  this 
criterion  resulted  in  the  inclusion  of  placebo  control  groups  from  studies  that 
evaluated  the  effects  of  supplements.  In  such  cases  the  placebo  group 
underwent  training  without  any  additional  experimental  manipulations.  Thus, 
the  effects  of  resistance  training  were  not  confounded  with  supplement  effects 
and  could  legitimately  be  included  in  an  overall  evaluation  of  resistance 
training  programs. 

2.  Strength  measurements  had  been  made  prior  to,  and  after,  the  program.  The 
specific  measurements  were  not  a  concern  at  this  point  in  the  search. 

3.  Study  participants  were  healthy.  This  requirement  excluded  studies  of  specific 
disease  populations,  including  chronic  obstructive  pulmonary  disease,  HIV 
infection,  chronic  heart  failure,  diabetes,  fibromyalgia,  and  so  forth.  The 
general  rule  was  that  a  study  was  excluded  if  the  authors  characterized  the 
study  population  as  “patients.”  Studies  of  people  who  were  characterized  as 
obese,  hypertensive,  or  frail  were  excluded.  However,  studies  of  “overweight” 
individuals  were  accepted,  so  weight  considerations  eliminated  only  studies  of 
individuals  toward  the  upper  end  of  the  excess  weight  range.  The  objective  in 
making  these  exclusions  was  to  eliminate  studies  that  might  produce  atypical 
effects  because  of  limitations  on  the  ability  to  perfonn  training  exercises, 
and/or  that  involved  disease  and  metabolic  processes  that  might  modify  the 
training  response. 

4.  The  average  study  participant  had  to  be  at  least  16  years  of  age.  This  criterion 
attempted  to  minimize  the  confounding  of  training  effects  with  the  effects  of 
nonnal  developmental  processes. 

5.  The  study  employed  isotonic  or  isoinertial  strength  measures.  Studies  that 
relied  on  isokinetic  or  isometric  measures  (k  =  174)  were  dropped  to  ensure 
that  the  operational  definition  of  strength  was  the  same  in  each  study.  This 
step  eliminated  measurement  methods  as  a  potential  source  of  variation  in  ES. 
This  criterion  was  introduced  because  Payne  et  al.  (1997)  have  demonstrated 
that  measurement  modality  affected  ES. 

The  full  text  of  438  articles  that  passed  the  screening  procedures  was  examined  to 
detennine  whether  the  studies  reported  the  basic  data  required  for  this  review.  The 
minimal  requirement  for  retention  during  this  phase  of  the  search  was  that  the  study  had 
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to  provide  the  statistics  required  to  compute  the  ES  for  isoinertial/isotonic  strength  tests. 
Specifically: 

1 .  The  unit  of  measurement  was  pounds  or  kilograms.  Studies  that  relied  on 
other  units  of  measurement,  such  as  Newtons,  pneumatic  measures,  or  percent 
change,  were  excluded. 

2.  The  study  reported  pre-  and  post- training  measures  of  strength  and  the 
standard  deviations  for  those  measures.  This  information  was  the  minimum 
required  to  compute  ES  when  combined  with  assumptions  about  the 
magnitude  of  the  pretest-posttest  correlation  (see  Appendix  A).  Strength  was 
measured  more  than  twice  in  some  studies.  When  this  was  the  case,  ES  was 
computed  using  the  initial  and  final  measurements.  Computing  the  effect  for 
each  phase  of  the  training  programs  would  have  increased  the  complexity  of 
the  repeated  measures  problem.  Therefore,  ES  always  represented  the  final 
cumulative  impact  of  training. 

Supplementary  searches  were  conducted  because  the  initial  search  and  data 
coding  took  long  enough  to  allow  further  studies  to  enter  the  literature.  Also,  informal 
reading  of  the  journals  that  provided  most  of  the  studies  identified  in  the  primary  search 
suggested  that  some  resistance  training  studies  might  have  been  missed  because  the 
studies  were  outside  the  scope  of  the  initial  search  tenns. 

The  first  supplementary  search  addressed  the  problem  by  adding  the  terms 
“Training”  and  “1-RM.”  The  search  term  1-RM  was  used  to  identify  studies  that  involved 
1-repetition  maximum  (1-RM)  strength  measures.  This  search  identified  242  articles,  139 
of  which  were  reviewed  in  detail. 

The  second  supplementary  search  repeated  the  initial  search,  but  covered  a 
different  time  frame.  The  same  keywords  were  used,  but  the  search  was  limited  to  articles 
published  between  January,  2007  and  February,  2009.  This  search  identified  706  articles. 
Examination  of  the  article  abstracts  reduced  the  number  of  articles  for  review  to  1 17  for 
direct  inspection.  The  inclusion  rules  for  the  initial  review  were  employed  for  these  later 
reviews. 

The  final  database  consisted  of  infonnation  from  1 96  studies  that  met  the  review 
criteria.  Control  groups  from  those  studies  were  excluded  from  the  review.  With  this 
restriction,  302  samples  provided  sufficient  data  to  be  included  in  this  review.  The 
cumulative  sample  size  was  4,574  study  participants. 
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Table  1 


Sample  Characteristics 


k 

IN 

Mean 

SD 

Minimum 

Maximum 

Age 

273 

4054 

33.20 

19.22 

15.9 

82.0 

Height 

234 

3539 

174.19 

7.57 

154.0 

189.0 

Weight 

270 

3941 

77.38 

11.64 

55.0 

125.0 

Percent  body  fat 

92 

1225 

21.35 

7.37 

11.6 

39.0 

Fat-free  mass 

58 

753 

60.18 

12.48 

27.8 

87.0 

Note.  Statistics  describe  samples  in  the  analysis,  not  individuals.  The  data  were  not  weighted  for  the 
computations  that  generated  these  descriptive  statistics.  The  statistics  describe  the  population  of  study 
samples  rather  than  a  population  of  individuals. 


Demographic  Characteristics 

Height,  weight,  percent  body  fat,  fat  free  mass.  Height,  weight,  percent  body 
fat,  and  fat  free  mass  were  coded  from  descriptive  statistics  reported  in  the  studies  (see 
Table  1).  Note  that  these  statistics  describe  samples  not  individuals. 

Gender.  The  samples  in  most  studies  consisted  entirely  of  men  or  entirely  of 
women.  Other  studies  combined  the  data  for  men  and  women  when  reporting  the  study 
findings.  A  few  studies  provided  no  definite  information  regarding  gender.  Given  this 
variability,  the  gender  of  each  sample  was  coded  as  male,  female,  men  and  women 
combined,  or  indetenninate. 

Age.  Age  most  often  was  reported  by  giving  the  average  ages  separately  for  each 
treatment  group  in  the  study.  In  other  cases,  the  study  reported  only  the  average  age  for 
all  of  the  study  participants.  An  age  range  (e.g.,  21  to  32  years)  was  another  common 
reporting  method.  Finally,  some  studies  did  not  report  age  directly,  but  provided  age- 
related  demographic  information  (e.g.,  college  students,  health  community-dwelling 
elders).  Given  this  variable  content,  age  was  reduced  to  a  dichotomy  identifying  older 
and  younger  samples.  Where  quantitative  infonnation  was  available,  samples  with  an 
average  age  <50  were  classified  as  younger.  Qualitative  data  were  coded  based  on 
judgments  of  the  age  range  that  would  be  typical  of  the  group  described.  This  dichotomy 
provided  a  variable  that  minimized  missing  data  for  age  because  it  could  be  applied  to  as 
many  samples  as  possible  given  the  available  information. 

Training  status.  Training  status  was  inferred  from  descriptions  of  the  recent 
training  status  of  study  participants.  The  initial  coding  of  training  status  employed  five 
categories:  sedentary,  recreationally  active,  athletes  in  sports  that  do  not  routinely  involve 
resistance  training,  recreational  weight  trainers,  and  competitive  weight  trainers.  The  last 
category  included  athletes  who  participated  in  weight  lifting  competitions  and  athletes  in 
sports  such  as  football,  where  weight  training  is  employed  to  gain  a  competitive 
advantage. 

Preliminary  analyses  showed  that  a  dichotomy  captured  most  of  the  ES  variation 
across  the  initial  five  categories.  Trained  samples  consisted  of  individuals  in  ongoing 
resistance  training  programs.  Untrained  samples  consisted  of  individuals  who  either  had 
no  prior  resistance  training  experience  or  who  had  not  trained  for  at  least  several  months 
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prior  to  starting  the  study.  The  two  training  status  categories  are  referred  to  as  “trained” 
and  “untrained”  participants  in  the  remainder  of  this  paper. 

Program  Design  Facets 

Length.  Program  length  was  the  number  of  weeks  that  the  training  program 
lasted.  In  some  cases,  baseline  strength  measurements  were  taken  well  before  beginning 
the  actual  training  process.  In  other  cases,  several  sessions  with  light  weights  were  used 
to  familiarize  the  study  participants  with  the  resistance  exercises.  In  either  case,  length 
included  only  the  actual  training  period  as  designated  by  the  study’s  authors. 

Sessions  per  week.  This  variable  represented  the  number  of  times  that  each 
exercise  in  the  program  was  performed  during  a  given  week.  This  number  could  be  less 
than  the  total  number  of  training  periods  per  week  because  some  studies  employed  split 
programs.  When  split  sessions  were  employed,  different  exercises  were  performed  at 
different  sessions.  The  training  specificity  principle  implies  that  the  training  effect  for  a 
particular  resistance  exercise  will  derive  almost  entirely  from  the  work  done  in  those 
sessions  in  which  that  exercise  is  actually  performed. 

Sets  per  session.  The  number  of  sets  per  training  session  was  coded  for  non- 
periodized  programs.  The  number  of  sets  per  session  varied  from  1  to  6.  When  the  sets 
per  session  varied  over  the  course  of  a  simple  progressive  program,  the  average  was 
computed  and  rounded  to  the  nearest  whole  number.  The  analyses  of  sets  per  session  also 
included  two  specific  contrasts  that  have  been  of  interest  in  prior  reviews:  1  set  versus  3 
sets  (Galvao  &  Taaffe,  2004)  and  1  set  versus  >1  set  (Wolfe  et  ah,  2004;  Rhea  et  al., 
2002).  The  number  of  sets  was  not  coded  for  periodized  programs.  The  systematic 
variation  in  the  number  of  sets  in  periodized  programs  made  it  questionable  whether  any 
single  value  would  be  representative  of  the  program. 

Intensity.  Intensity  is  the  percentage  of  1-RM  lifted  during  each  repetition  in  a 
training  set.  In  practice,  intensity  is  defined  by  the  number  of  repetitions  in  a  set.  The 
assumption  is  that  a  person  can  complete  one  repetition  for  every  2  -  3%  of  1-RM.  Thus, 
defining  the  target  intensity  as  75%  of  1-RM  corresponds  to  the  expectation  that  program 
participants  will  reach  voluntary  fatigue  after  completing  10  repetitions  within  a  set  of 
each  exercise.  If  the  target  intensity  was  80%  of  1-RM,  program  participants  would  be 
expected  to  reach  voluntary  exhaustion  after  8  repetitions.  The  translation  of  exercise 
repetitions  into  percentages  is  only  approximate  and  some  allowance  must  be  made  for 
fatigue  that  develops  during  a  series  of  exercises.  With  this  in  mind,  training  intensity  is 
usually  defined  as  a  range  of  repetitions,  e.g,  8-10  repetitions  per  set. 

Most  studies  covered  in  this  review  defined  intensity  in  terms  of  the  target  range 
for  the  number  of  repetitions  to  voluntary  exhaustion  (e.g.,  8-10  repetitions).  Intensity 
was  coded  by  taking  the  target  range  midpoint  for  repetitions.  If  the  study  employed  a 
familiarization  period,  weighted  average  of  the  midpoints  of  the  target  ranges  for  the 
familiarization  and  training  periods  was  computed.  The  weight  was  the  number  of  weeks 
for  each  target  range.  The  weighted  averages  were  rounded  to  provide  the  final  estimate 
of  the  number  of  repetitions  per  set.  The  rounded  number  of  repetitions  was  converted  to 
estimated  percentages  of  1-RM,  subtracting  2.5%  for  each  repetition.  For  example,  if  a 
program  averaged  8  repetitions  per  set  over  the  entire  training  period,  the  coded  intensity 
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was  (1  -  (8*. 025)*  100)  =  80%.  The  estimated  average  intensity  was  coded  into  5% 
ranges  (e.g.,  72.5%  and  75%)  with  the  upper  bound  used  to  label  the  range. 

Periodization.  Periodization  is  a  training  strategy  that  varies  training  objectives, 
training  intensity,  and  training  volume  in  planned  cycles  (Wathen,  Baechle  &  Earle, 
2000).  The  typical  periodized  program  has  three  phases.  The  hypertrophy/endurance 
phase  employs  exercises  with  very  low  to  moderate  intensity  (50%  to  75%  of  1-RM)  with 
high  to  moderate  volume  (3-6  sets  of  10  -20  repetitions.  The  basic  strength  phase  couples 
high  intensity  exercise  (80%  -  90%  of  1-RM)  with  moderate  volume  (3-5  sets  of  4  -  8 
repetitions).  The  strength/power  phase  combines  high  intensity  (75%  -  90%  of  1-RM) 
with  low  volume  (3-5  sets  of  2  -  5  repetiations).  Progression  through  these  three  phases 
constitutes  a  training  cycle.  The  overall  structure  of  a  periodized  program  includes 
microcycles  of  1  to  4  weeks,  mesocycles  of  several  weeks  to  several  months,  and 
macrocycles  that  typically  last  a  year,  but  may  be  as  long  as  4  years.  Most  of  the  studies 
in  this  review  lasted  only  a  few  months,  so  the  typical  program  involved  one  cycle 
through  all  three  periodization  phases.  A  few  studies  included  multiple  microcycles  by 
having  very  short  phases  (e.g.,  1  day  or  1  week). 

Table  2  describes  the  distribution  of  the  demographic  variables  and  program 
characteristics  in  the  overall  data  set. 

Analysis  Procedures 

Every  study  included  in  this  review  used  a  pretest-posttest  design.  For  this  reason, 
methods  described  by  Morris  and  DeShon  (2002)  were  applied  to  compute  appropriate 
ESs  for  repeated  measures  (simple  ESrm;  see  Appendix  A).  Subsequent  steps  were  taken 
to  solve  two  problems.  First,  in  many  studies,  more  than  one  strength  test  was 
administered  to  each  sample.  Consequently,  there  was  more  than  one  ES  for  each  sample. 
The  associated  loss  of  ESs  independence  posed  statistical  problems  (Gleser  &  Olkin, 
1994).  Second,  exploratory  analyses  showed  that  ESrm  depended  on  which  strength  test 
was  administered.  Different  programs  could  appear  to  produce  stronger  or  weaker  than 
average  effects  by  choosing  particular  strength  tests  as  outcome  measures.  These 
problems  were  addressed  by  identifying  the  10  most  frequently  used  strength  tests. 
Analysis  showed  that  this  restriction  largely  eliminated  the  differences  between  tests,  so 
treating  each  test  as  providing  a  separate  estimate  of  a  common  training  effect  for  each 
sample  was  reasonable.  The  ESRM  values  for  those  tests  were  averaged  to  obtain  a  single 
average  ESrm  for  each  sample  (see  Appendix  B). 1 


1  This  review  employed  several  different  ES.  An  effect  that  is  labeled  “ES”  refers  to  an  ES  computed 


without  considering  the  repeated  measures  data  structure,  i.e.,  ES  = 


SD 


Pre 


.  An  effect  that  is 


labeled  “ESrm”  is  the  difference  between  post-training  and  pre -training  scores  with  an  adjustment  for 
repeated  measures.  An  effect  that  is  labeled  “average  ESrm”  refers  to  a  measure  that  is  the  average  of  the 
scores  on  1  to  8  scores  on  frequently  used  tests  (see  Appendix  B).  An  effect  that  is  labeled  “adjusted 
average  ESRM”  refers  to  the  average  ESRM  adjusted  for  program  length  using  an  equation  describing  the 
association  of  program  length  with  effect  size  based  on  all  ESrm  values  (see  Equation  1).  Finally,  an  effect 
that  is  labeled  “population-adjusted  average  ESRM”  is  the  average  ESRM  adjusted  for  program  length  using 
separate  equations  for  trained  and  untrained  samples  (see  Equations  2  and  3). 
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Table  2 

General  Structure  of  the  Data  Set 


No.  Studies 

No.  Samples 

No.  ESs 

ENa 

No.  Studies 

No.  Samples 

No.  ESs 

£Na 

Gender 

Periodization 

Men 

117 

185 

371 

2247 

No 

108 

186 

514 

2635 

Women 

37 

53 

162 

706 

Yes 

68 

108 

199 

1878 

Men  and  women 

39 

59 

179 

1540 

Sets  per  session 

Age  group 

1 

17 

31 

87 

543 

Younger 

151 

240 

504 

3704 

2 

6 

11 

32 

203 

Older 

45 

62 

224 

870 

3 

65 

114 

336 

1533 

4 

11 

12 

21 

145 

Training  status 

5 

10 

11 

22 

145 

Untrained 

131 

189 

514 

2935 

6 

2 

4 

10 

39 

Trained 

56 

100 

179 

1278 

Intensity 

<60% 

2 

4 

15 

57 

Sessions/week 

60% 

3 

9 

16 

124 

1 

4 

6 

19 

70 

65% 

2 

3 

9 

33 

2 

66 

107 

215 

1980 

70% 

3 

4 

17 

97 

3 

98 

171 

453 

2304 

75% 

45 

69 

186 

986 

4 

7 

9 

20 

110 

80% 

25 

41 

127 

593 

5 

1 

1 

4 

12 

85% 

26 

38 

63 

500 

Total 

196 

302 

728 

4574 

Note.  The  total  sample  size  for  the  moderator  variables  can  be  less  than  4574  because  some  moderator  variables  could  not  be  coded  for  some  samples. 
aZN  is  the  cumulative  sample  size  for  each  group. 
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The  number  of  ESs  that  contributed  to  the  average  ESrm  varied  from  sample  to 
sample.  The  number  of  common  strength  tests  administered  to  different  samples  ranged 
from  0  to  8.  The  average  ESrm  most  often  was  based  on  1  (k  =  121)  or  2  (k  =  115)  tests. 
The  averages  were  based  on  3  to  6  tests  for  57  samples.  Samples  that  had  not  perfonned 
any  of  the  common  tests  ( k  =  4)  were  dropped  from  the  analysis. 

Meta-regression  models  evaluated  potential  moderator  variables.  A  moderator 
was  a  demographic  variable  or  a  program  element  that  might  account  for  variation  in 
ESrm-  The  meta-regression  analyses  applied  Hedges  and  Olkin’s  (1985)  general  methods. 
These  methods  included  weighted  analysis  of  variance  (ANOVA)  and  weighted  linear 
regression.  The  weight  variable  was  the  inverse  of  the  estimated  variance  for  ESrm- 

Moderators  were  evaluated  in  two  steps.  The  first  step  was  an  overall  test  for  a 
moderator  effect.  This  test  determined  whether  the  adjusted  ESrm  differed  significantly 
across  the  moderator  groups.  The  second  step  was  taken  only  if  there  was  a  statistically 
significant  moderator  effect.  Post  hoc  comparisons  were  conducted  to  determine  which 
groups  differed  significantly.  The  moderator  groups  were  rank  ordered  from  largest  to 
smallest  average  ESRM.  The  group  with  the  largest  average  ESRM  was  adopted  as  the 
reference  group.  The  first  post  hoc  test  compared  the  reference  group  with  the  group  with 
the  second  largest  average  ESRM.  If  these  two  groups  differed  significantly,  the  post  hoc 
comparisons  stopped  at  this  point.  If  the  two  groups  did  not  differ  significantly,  the  group 
with  the  third-largest  average  was  compared  with  the  reference  group.  The  comparisons 
continued  down  the  ranked-ordered  moderator  groups  until  a  significant  difference  was 
found.  The  comparisons  stopped  at  that  point,  and  all  remaining  groups  were  classified  as 
differing  significantly  from  the  reference  group. 

Some  post  hoc  comparison  procedures  required  multiple  significance  tests. 
Perfonning  multiple  significance  tests  increased  the  probability  that  at  least  one 
comparison  would  be  statistically  significant  by  chance  alone.  A  Bonferroni  significance 
criterion  was  adopted  to  fix  the  analysis-wide  probability  of  error  at  5%  or  less.  The  post 
hoc  procedures  involved  j  —  1  comparisons  for  a  moderator  with  j  levels.  The  Bonferroni 
criterion  for  each  moderator  was  pcritical  =  .05  /(j  - 1)  . 

The  post  hoc  comparisons  identified  equivalence  sets.  These  sets  consisted  of  the 
design  option  with  the  largest  average  effect  plus  the  alternative  options  that  were  not 
significantly  different  from  this  reference  value.  The  sets  were  equivalent  in  the  sense 
that  the  alternative  set  options  could  not  be  confidently  classified  as  less  effective  than 
the  optimum  design  option  based  on  the  available  evidence. 

Large  samples  can  produce  significant  results  even  for  trivial  differences 
(Rosenthal  &  Rosnow,  1984).  To  avoid  mistaking  sample  size  for  explanatory  power,  the 
Tucker-Lewis  index  (TLI;  Tucker  &  Lewis,  1973)  was  adapted  to  provide  an  ES  index 
for  the  moderator  analyses.  This  index  is  the  proportion  of  the  greater-than  chance 
variation  in  ESrm  accounted  for  by  a  moderator  or  set  of  moderators  (Appendix  C). 
Cohen’s  (1988)  ES  criteria  were  applied  to  characterize  the  TLI  as  indicating  trivial, 
small,  moderate,  or  large  moderator  effects. 

Funnel  plots  were  constructed  to  evaluate  the  potential  effects  of  publication  bias 
(Light  &  Pillemer,  1984).  Egger,  Smith,  Schneider,  and  Minder’s  (1997)  regression 
method  was  applied  to  obtain  a  formal  statistical  assessment  of  the  hypothesis  that 
publication  bias.  The  file  drawer  problem  was  not  examined  because  both  the  typical  ES 
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and  the  total  number  of  studies  were  large.  Under  those  circumstances,  Rosenthal’s 
(1979)  file  drawer  criterion  certainly  would  be  satisfied. 

All  analyses  were  carried  out  with  the  computer  program  SPSS-PC  (Version  17). 


Results 

Program  Length  Effect 

Average  ESrm  increased  with  program  length.  Preliminary  analyses  of  this 
association  compared  linear,  quadratic,  logarithmic,  power,  and  growth  models  as 
functional  representations  of  the  relationship  of  ESrm  with  program  length.  The 
logarithmic  model  given  as  Equation  1  provided  the  best  prediction  of  average  ESRM: 

y  =  .412  +  .545  *  ln(t) .  (1) 

The  correlation  of  average  ESRM  with  program  length  was  modest  (r  =  .21),  but 
statistically  significant  (x2  =  96.88,  1  df,p<  .001). 

The  linear  fonn  of  the  model  could  produce  a  mistaken  impression.  The  intercept, 
.412,  might  be  mistakenly  interpreted  as  indicating  that  ESrm  was  >0  at  the  beginning  of 
training.  This  would  be  the  usual  interpretation  of  the  intercept  if  the  Equation  1  was  a 
simple  linear  regression  of  ESrm  on  weeks  of  training.  The  intercept  did  not  have  this 
interpretation  because  the  predictor  had  been  transformed.  Solving  the  equation  for  ESRM 
=  0,  the  estimated  time  to  produce  an  effect  of  this  size  was  3.3  days.  This  estimate  would 
correspond  to  1  or  2  training  sessions  in  a  typical  program. 

Initial  Moderator  Analyses 

The  average  ESrm  was  the  dependent  variable  for  the  initial  moderator  analyses. 
All  of  the  potential  moderator  effects  were  statistically  significant  except  for 
periodization.  The  %2  and  TLI  values  in  Table  3  showed  that  adjusting  for  program  length 
generally  reduced  the  strength  of  the  moderator  effects.  The  exceptions  were  the  slightly 
larger  x2  values  for  periodization  and  repetitions  per  set.  Despite  the  general  trend  toward 
weaker  effects,  adjusting  for  program  length  only  changed  one  conclusion  regarding  the 
presence  of  a  moderator  effect.  Age  significantly  moderated  the  average  ESrm  values,  but 
was  only  marginally  significant  in  the  analysis  of  the  adjusted  average  ESrm  values. 
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Table  3 


Initial  Tests  for  Moderator  Effects 


Average  ESrm3 _ Adjusted  average  ESrm*7 


yf 

df 

Sig 

TLI 

t 

Sig 

TLI 

Age 

39.39 

1 

.000 

.016 

3.73 

.054 

.000 

Gender  group 

84.40 

2 

.000 

.034 

61.96 

.000 

.025 

Men  vs.  women 

49.31 

1 

.000 

.031 

18.42 

.000 

.010 

Training  status 

445.73 

1 

.000 

.225 

395.47 

.000 

.215 

Periodization 

.83 

1 

.363 

.000 

1.02 

.313 

.000 

Sessions  per  week 

58.29 

4 

.000 

.013 

40.20 

.000 

.005 

Sets  per  session 

93.85 

5 

.000 

.038 

73.27 

.000 

.025 

1  Set  vs.  3  setsrf 

50.46 

1 

.000 

.040 

38.56 

.000 

.030 

1  Set  vs.  multiple^ 

66.46 

1 

.000 

.042 

58.43 

.000 

.039 

Repetitions  per  set 

88.79 

6 

.000 

.040 

92.20 

.000 

.047 

"The  average  ESRM  was  based  on  each  samples  scores  for  a  subset  of  the  10  most  frequently  used  strength 
tests  (see  Analysis  Procedures).  AThe  adjusted  average  ESRM  was  the  average  ESRM  corrected  for  program 
length  based  on  Equation  1 .  "The  degrees  of  freedom  were  the  same  for  both  analyses.  "The  single  set 
comparisons  were  added  to  provide  direct  comparisons  to  prior  reviews. 


2 

Training  status  was  an  especially  strong  moderator  of  ESrm  (see  Table  3).  The  yj 
for  the  training  status  moderator  effect,  yjj  =  395.47,  was  larger  than  the  sum  of  the  yjj 
values  for  all  of  the  other  moderators,  Irf  =  387.79.  Periodization  was  noteworthy  as  the 
only  moderator  that  was  not  significant  in  either  analysis. 

Isolating  Demographic  Moderator  Effects 

The  initial  bivariate  moderator  analyses  were  followed  by  more  focused  analyses 
designed  to  isolate  the  effects  of  specific  demographic  variables.  These  focused  analyses 
were  needed  because  age,  gender,  and  training  status  were  confounded  in  the  data.  For 
example,  all  of  the  samples  with  trained  participants  were  younger,  and  nearly  all  of  them 
were  male.  Given  the  strong  association  of  training  status  with  the  adjusted  average 
ESrm,  the  confounding  of  training  status  with  age  and  gender  could  bias  the  assessment 
of  age  and  gender  moderator  effects.  To  control  for  this  possible  bias,  the  moderator 
effects  of  age,  gender,  and  training  status  were  re-evaluated  with  the  other  two 
demographic  variables  held  constant.  For  example,  the  analysis  was  limited  to  young  men 
when  the  effects  of  training  status  were  evaluated.  The  restriction  eliminated  any  possible 
effects  of  age  and  gender  on  the  comparison. 

Table  4  presents  the  results  that  were  obtained  when  appropriate  restrictions  were 
introduced  to  isolate  the  effects  of  gender,  training  status,  and  age: 

Gender.  Gender  did  not  affect  the  training  response  for  younger  untrained 
individuals  (%  =  1.06,/?  <  .303).  Older  men  produced  a  significantly  stronger 
training  response  than  older  women  (x2  =104.3 1,  1  dfp<  .001). 
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Table  4 

Isolated  Demographic  Moderator  Effects 


Gender 

Men 

k 

Women 

k 

z2 

Sig. 

Younger  untrained 

2.23 

73 

2.35 

32 

1.06 

.303 

Older 

3.45 

24 

1.95 

16 

104.31 

.000 

Training  status 

Untrained 

k 

Trained 

k 

z2 

Sig. 

Young  men 

2.23 

73 

1.01 

76 

245.39 

.000 

Age 

Younger 

k 

Older 

k 

z2 

Sig. 

Untrained  men 

2.25 

73 

3.45 

24 

98.04 

.000 

Untrained  women 

2.35 

32 

1.95 

16 

9.26 

.003 

Note.  The  adjusted  average  ESRM  was  the  dependent  variable  for  the  analyses  reported  in  this  table. 


Training  status.  The  training  response  was  much  larger  in  samples  of  untrained 
individuals  than  trained  individuals  (x2  =  245.39,  1  dfp<  .001). 

Age.  Older  men  produced  significantly  larger  training  effects  than  younger  men 
(X^  =  98.04,1  dfp  <  .001).  Older  women  produced  a  significantly  weaker  training 
effect  than  younger  women  (x2  =  9.26,  1  dfp  <.003). 

The  exceptionally  large  adjusted  average  ESRM  for  older  men  was  central  to  both 
the  gender  and  age  moderator  effects.  If  this  group  had  been  ignored,  there  would  have 
been  no  substantial  effects  for  either  gender  or  age.  If  further  study  were  to  show  that  the 
available  data  overestimate  the  true  response  of  older  men,  it  would  be  appropriate  to 
conclude  that  neither  age  nor  gender  is  an  important  training  response  moderator 

Interaction  of  Population  with  Program  Design  Facets 

Training  programs  often  are  designed  for  a  specific  population.  For  this  reason,  it 
was  important  to  ask  whether  program  design  facets  had  the  same  effect  in  different 
populations.  This  question  was  pursued  with  training  status  and  gender  as  the  possible 
program  design  effect  moderators.  Age  was  not  considered  in  connection  with  this 
question  because  it  was  not  clear  whether  analyses  that  included  the  evidence  from 
samples  of  older  men  would  yield  meaningful.  Separate  analyses  for  men  and  women 
were  not  conducted  because  the  sample  sizes  would  have  been  too  small  to  have 
confidence  in  the  results. 

Weighted  ANOVAs  assessed  the  combined  effects  of  demographic  variables 
(e.g.,  gender)  and  program  elements  (e.g.,  number  of  sets  per  session)  with  the  remaining 
demographic  variables  held  constant.  For  example,  Table  5  presents  the  results  from 
analyses  that  were  restricted  to  young  men. 
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Table  5 


Younger  Untrained  Males  versus  Younger  Trained  Males 


Experience 

X2  df 

Design 

t 

df 

ExD  Interaction 

X2  df 

ExD 

TLI 

Residual 

X2  df 

Periodization 

176.52 

1 

4.20 

1 

11.24 

1 

.028 

142 

Sessions 

143.04 

1 

14.59 

4 

12.71 

1 

.031 

136 

Sets 

69.96 

1 

6.29 

5 

16.69 

2 

.037 

68 

1  vs.  3  sets 

149.44 

1 

1.11 

1 

1.78 

1 

.000 

46 

1  vs.  multiple 

164.70 

1 

1.58 

1 

8.69 

1 

.020 

73 

Repetitions 

183.21 

1 

20.33 

6 

1.58 

2 

.000 

62 

Note.  The  dependent  variable  was  the  adjusted  average  ESRM.  The  TL1  values  are  partial  TLI  s 
based  on  the  f  values  for  the  interaction  and  the  residual. 


The  interactions  of  demographic  characteristics  with  design  facets  were  the 
primary  concern  in  the  analyses  reported  in  Tables  5  and  6.  A  significant  interaction 
indicated  that  the  program  facet  impact  depended  on  which  population  was  considered.  A 
significant  interactions  would  be  reason  to  consider  providing  population-specific 
training  recommendations.  Thus,  the  TLI  for  each  interaction  has  been  reported  to  ensure 
that  the  differences  involved  were  large  enough  to  be  important. 

Training  status.  For  young  men,  the  interaction  of  training  status  with  program 
design  was  statistically  significant  for  every  element  except  the  number  of  sets  per 
session  (see  Table  5).  The  specific  contrast  of  1  set  per  session  with  3  sets  per  session 
also  failed  to  reach  statistical  significance  (x2  =  1.78,  1  df,p>  .182),  but  the  overall  effect 
of  sets  per  session  was  significant  (x2  =  16.69,  2  df,  p  <  .001)  as  was  the  contrast  of  1  set 
with  >1  set  per  session  yj  =  8.69,  1  dfp<  .003).  In  each  case,  program  design  affected 
training  outcomes  for  untrained  individuals  more  than  it  affected  the  training  outcomes 
for  trained  individuals. 


Table  6 

Gender  Holding  Age  and  Experience  Constant 


Gender 

X2  df 

Design 

x2 

df 

GxD  Interaction 

X2  df 

GxD 

TLI 

Residual 

X2  df 

Periodized 

.10 

1 

9.63 

1 

.00 

1 

.000 

454.47 

98 

Sessions 

4.55 

1 

35.86 

3 

.37 

1 

.000 

419.34 

94 

Sets 

.02 

1 

11.58 

5 

1.39 

1 

.000 

371.16 

69 

1  vs.  3  sets 

.31 

1 

.77 

1 

1.20 

1 

.000 

332.41 

56 

1  vs.  multiple 

.20 

1 

.00 

1 

1.58 

1 

.000 

382.50 

74 

Repetitions 

.22 

1 

14.66 

5 

3.88 

4 

.000 

304.18 

58 

Note.  Analyses  compared  young  untrained  men  and  young  untrained  women.  The  dependent  variable 
was  the  adjusted  average  ESRM. 
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Gender.  The  gender  analysis  compared  young  untrained  men  and  women.  No 
gender  interaction  was  statistically  significant  (see  Table  6).  In  every  case,  the  variation 
explained  by  the  interaction  was  not  even  as  large  as  would  have  been  expected  by 
chance  (i.e.,  TLI  =  .000).  It  was  also  worth  noting  that  gender  produced  a  statistically 
significant  difference  in  only  one  of  six  analyses  (sessions  per  week). 


Population-Specific  Adjustments  for  Program  Length 

Population-specific  adjustment  equations.  Population-specific  program  length 
adjustments  were  explored  because  the  weighted  average  adjusted  ESRM  of  trained 
individuals  was  less  than  half  that  of  untrained  individuals  (trained,  adjusted  average 
ESrm  =.97;  untrained,  adjusted  average  ESRM  =  2.24).  The  two  populations  underwent 
training  programs  of  approximately  equal  length  (trained,  10.84  weeks;  untrained,  10.28 
weeks;  0,222=  1.10 ,p  =  .272),  so  the  difference  in  the  average  training  effect  sizes 
implied  different  growth  rates  for  training  effects. 

The  relationship  of  program  length,  expressed  in  weeks,  to  ESRM  depended  on 
training  status.  The  regression  lines  were  not  parallel  in  an  analysis  of  covariance  (f  = 
13.51,  1  dfp  <  .001).  The  population-specific  regression  equation  for  untrained 
individuals  was: 


ESrm  =  .  1 1 1  +  .78 1  *  In  {length) .  (2). 

The  relationship  was  highly  significant  (x  =  57.48,  1  dfp  <  .001,  r  =  .29)  in  this 
population.  The  corresponding  regression  equation  for  trained  individuals  was: 

ESm  =  .485  +  .189*  In  (length) .  (3). 

The  relationship  was  not  statistically  significant  (x~  =  1.21,  1  df,  p  >.271,  r  =  .07)  for 
trained  individuals.  Solving  the  equations  for  ESRM  =  .00  indicated  that  positive  training 
effects  would  be  predicted  after  6  days  for  untrained  individuals  and  after  one  day  for 
trained  individuals.  The  former  value  implies  that  3  or  4  training  sessions  are  needed 
before  the  ESRm  exceeds  0;  the  latter  value  implies  gains  from  the  first  training  session 
onward. 

The  smaller  coefficient  for  In  (length)  in  Equation  3  for  trained  individuals 
indicated  a  slower  rate  of  improvement  for  that  population.  The  association  was  not 
statistically  significant,  but  it  was  consistent  with  everyday  observations  that  even  trained 
populations  show  strength  gains  over  time.  For  this  reason,  population-specific  adjusted 
ESrm  variables  were  computed  for  both  trained  and  untrained  samples.  In  the  following 
results,  average  ESRM  estimates  that  have  been  adjusted  using  Equation  2  or  Equation  3 
are  referred  to  as  population-adjusted  average  ESRm  to  distinguish  them  from  the  earlier 
adjusted  average  ESRM  estimates  based  on  Equation  1 . 

Population-specific  moderator  effects  for  untrained  individuals.  The  earlier 
assessment  of  training  status  as  a  moderator  was  limited  to  young  men.  Samples  were 
included  regardless  of  gender  composition  because  the  population-specific  moderator 
analyses  had  shown  that  gender  did  not  moderate  the  training  response  for  the  adjusted 
average  ESRm. 
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Table  7 


Design  Facet  Moderator  Effects  for  Young  Untrained  Individuals 


Moderator 

Level 

Population-adjusted  avga 

kb 

Equivalence  setc 

Periodized 

No 

2.34 

100 

Yes 

2.01 

24 

X2  =  9.85,  df=l,p  <.001,  TLI  =  .009 

{No} 

Sessions 

1 

0.99 

1 

2 

1.96 

26 

3 

2.42 

93 

4 

1.67 

1 

5 

2.29 

1 

X2  =  36.51,  4  dfp  <  .001,  TLI  =  .023 

{3} 

Sets 

1 

1.96 

18 

2 

2.39 

6 

3 

2.49 

56 

4 

2.38 

10 

5 

1.76 

3 

6 

2.38 

4 

X2  =  24.87,  5  dfp  <  .001,  TLI  =  .000 

{3,2,  6,4} 

Intensity 

<  60% 

2.49 

65% 

1.58 

70% 

2.57 

75% 

2.45 

80% 

2.01 

85% 

2.79 

i  r~  • .  • 

X2  =  39.31,  5  dfp<  .001,  TLI  =  .071 

n  ,  i  1  ,  •  i  •  ,  i  r-'r-i  ■ 

{85%,  <60%) 

i  n  i  .1  , 

provided  averages  for  analysis.  Yhe  equivalence  sets  include  all  design  options  that  were  not  significantly 
different  from  the  option  with  the  highest  population-adjusted  average.  The  design  options  have  been  listed 
from  largest  to  smallest  population-adjusted  average  ESRM  in  the  set. 


All  four  program  design  facets  proved  to  be  statistically  significant  moderators  of 
the  population-adjusted  average  ESrm  training  effect  for  untrained  program  participants 
(see  Table  7). 

Periodization.  Periodized  programs  were  less  effective  than  simple  progressive 
programs,  but  the  difference — while  statistically  significant — was  too  small  to  be 
important  (TLI  =  .009).  Non-periodized  programs  were  singled  out  as  a  best  practice,  but 
the  small  TLI  made  this  a  dubious  designation. 
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Sessions  per  week.  The  post  hoc  comparison  was  limited  to  contrasting  2  sessions 
per  week  with  3  sessions  per  week.  No  comparisons  were  made  for  1,  4,  or  5  sessions  per 
week  because  each  of  these  options  was  represented  by  a  single  sample.  Therefore,  while 
the  post  hoc  comparisons  identified  3  sessions  per  week  as  a  best  practice  for  this 
population,  this  designation  was  advanced  with  the  caution  that  this  characterization 
should  not  be  given  too  much  credence  until  there  is  more  evidence  regarding  other 
options — especially  4  and  5  sessions  per  week. 

Sets  per  session.  The  equivalence  set  included  four  design  options.  Three  sets  per 
session  produced  the  largest  population-adjusted  average  ESrm,  but  that  mean  effect  of  3 
sets  per  session  was  not  significantly  greater  than  mean  effects  seen  with  2,  4,  or  6  sets 
per  session.  Also,  1  set  per  session  was  significantly  less  effective  than  other  choices, 
whether  the  comparison  was  to  all  other  options 

(XT  =  22.00,  1  dfp<  .001)  or  to  the  specific  choice  of  3  sets  per  session 

<2  -  19.24,  llf  p<. 001). 

Intensity.  An  85%  target  intensity  produced  the  largest  population-adjusted 
average  effect.  The  post  hoc  comparisons  added  75%  or  <60%  to  the  equivalence  set.  The 
differences  were  statistically  significant  even  when  the  analysis  was  limited  to  the  three 
intensities  with  >3  population-adjusted  average  ESrm  values  (x2  =  32.37,  2  df,p<  .001). 

Moderator  effects  for  trained  individuals.  The  population-adjusted  average  ESrm 
based  on  Equation  3  was  the  dependent  variable  for  the  analyses  reported  in  Table  8. 

Periodization.  Periodized  programs  produced  slightly  larger  effects  than  non- 
periodized  programs,  but  the  difference  was  not  statistically  significant. 

Sessions  per  week.  The  initial  moderator  test  was  not  statistically  significant,  so 
the  equivalence  set  included  all  four  options. 

Sets  per  session.  Four  sets  per  session  produced  the  largest  training  effect.  Post 
hoc  comparisons  indicated  that  the  effect  of  4  sets  per  session  was  not  significantly 
greater  than  the  effect  of  5  sets  per  session.  The  difference  between  1  and  3  sets  per 
session  approached  statistical  significance  (x2  =  2.65,  1  dfp<  .104).  The  difference 
between  1  set  and  >1  set  was  statistically  significant  (f  =  7.07,  1  dfp<  .001). 

Intensity.  A  target  intensity  of  60%  produced  the  largest  effect,  but  it  was  not 
significantly  greater  than  the  effect  for  80%  or  85%.  Note  that  if  the  effect  for  60%  is 
ignored,  the  trends  for  intensity  followed  the  general  rule  of  thumb  that  “more  is  better.” 
The  population-adjusted  average  ESrm  increased  from  75%  to  85%,  but  those  three 
options  did  not  differ  significantly  (p  >  .094).  Thus,  the  evidence  would  support  an 
equivalence  set  consisting  of  three  options,  even  if  the  60%  programs  were  excluded 
from  consideration. 
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Table  8 

Design  Facet  Moderator  Effects  for  Young  Trained  Individuals 


Moderator 

Level 

Population-adjusted  avg.a 

kb 

Equivalence  setc 

Periodized 

No 

.84 

33 

Yes 

1.01 

56 

{Yes,  No} 

X2  =  2.82,  1  dfp  >  .093,  TLI  =  .012 

Sessions 

1 

.74 

5 

2 

.92 

55 

3 

1.00 

25 

4 

.93 

5 

X2  =  1.41,  3  dfp>  .703,  TLI  =  .000 

{3.4,  2,1} 

Sets 

1 

.46 

7 

3 

.78 

16 

4 

1.73 

2 

5 

1.03 

7 

X2  =  15.72,  3  dfp<  .002;  TLI  =  .183 

{4,5} 

Intensity 

60% 

1.41 

6 

75% 

.58 

13 

80% 

.78 

3 

85% 

1.04 

8 

an  >  a  n  a i 

i  r-  'a' 

X2  =  15.95,  3  dfp<  .002,  TLI  =  .136 

p  i  i  •  i*  i  i  nn  btti  it 

{60%, 85%} 

i  n  i 

that  provided  averages  for  analysis.  cThe  equivalence  sets  include  all  design  options  that  were  not 
significantly  different  from  the  option  with  the  highest  population-adjusted  .  The  design  options 
are  listed  from  largest  to  smallest  population-adjusted  ESRM  in  the  set. 


Publication  Bias 

Separate  funnel  plots  (Light  &  Pilletner,  1984)  were  constructed  for  the  adjusted 
average  ESRM  for  young  untrained,  young  trained,  and  older  program  participants.  The 
plots  appeared  to  be  truncated  on  the  left  hand  side  because  even  small  studies  always 
produced  positive  effects.  Symmetry  would  have  required  some  small  to  moderate 
negative  effects.  The  regression  method  developed  by  Egger  et  al.  (1997)  provided 
statistical  confirmation  of  the  impression  derived  from  the  funnel  plots.  The  intercept  of 
the  regression  was  significantly  (p  <  .001)  greater  than  zero  in  each  population. 
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Discussion 

This  review  did  not  identify  best  practices.  Program  design  facets  were 
statistically  significant  moderators  of  ESRM,  but  post  hoc  analyses  seldom  singled  out  one 
option  as  significantly  better  than  all  others.  Only  one  case  involving  more  than  two 
options  produced  a  best  practice.  Four  sets  per  session  was  the  best  option  for  this  design 
facet  for  trained  individuals.  Rhea  et  al.  (2003)  reached  the  same  conclusion,  but  it  should 
be  noted  that  the  estimated  effect  of  4  sets  per  session  has  been  based  on  very  little 
evidence.  In  the  present  review,  the  estimate  was  based  on  evidence  from  just  two 
samples.  Even  one  or  two  additional  studies  that  produced  smaller  effects  for  this  option 
could  alter  the  conclusion  that  4  sets  per  session  is  a  best  practice. 

It  was  also  true  that  non-periodized  training  programs  were  more  effective  than 
periodized  programs  for  untrained  individuals.  Non-periodized  training  programs 
technically  would  be  a  best  practice,  but  the  TLI  was  too  small  to  make  this  a  strong 
recommendation.  Such  guidance  would  not  really  qualify  as  designating  a  best  practice  in 
any  case.  The  design  of  a  non-periodized  program  requires  decisions  about  the  number  of 
sessions,  sets  per  session,  and  repetitions  per  set.  Best  practices  could  not  be  identified 
for  those  program  design  facets,  so  knowing  that  non-periodized  programs  were  better 
than  periodized  programs  would  not  lead  to  the  specification  of  any  optimal  design 
specifics. 

Failing  to  identify  best  practices  is  not  unique  to  this  review.  Other  reviews  show 
the  same  general  pattern  of  results.  Program  design  facets  often  are  statistically 
significant  moderators  of  the  training  response,  but  post  hoc  analyses  fail  to  identify  any 
single  option  as  significantly  better  than  all  other  options  (see  Appendix  E).  Given  this 
general  trend,  the  current  findings  could  not  be  dismissed  as  resulting  from  the  inclusion 
criteria  or  analysis  procedures  that  have  been  employed  in  the  current  review. 

The  statistical  methods  adopted  in  this  review  should  have  sharpened  the  contrasts 
between  design  options.  First,  repeated  measures  analyses  are  expected  to  produce  larger 
effect  sizes.  This  effect  was  present  in  the  current  analyses  as  indicated  by  comparing  the 
average  ESRM  of  1.76  with  the  average  simple  ES  of  1.05.  The  difference  arose  because 
the  estimated  sampling  variability  was  smaller  for  repeated  measures  analyses.  The 
smaller  sampling  variability  also  would  amplify  differences  between  the  average  ESRm 
values  for  different  design  options  in  post  hoc  analyses,  so  the  current  procedure  should 
have  increased  the  likelihood  of  finding  a  best  practice.  Second,  if  the  results  obtained 
with  different  strength  tests  all  are  estimates  of  the  same  training  effect,  the  use  of 
average  ESRm  measures  increased  the  precision  of  the  effect  size  estimates.  Increasing 
precision  is  the  same  as  increasing  measurement  reliability  (American  Psychological 
Association,  1985),  thereby  reducing  the  effects  of  attenuation  due  to  measurement  error 
(Nunnally  &  Bernstein,  1994).  Once  again,  this  should  increase  the  likelihood  that  the 
post  hoc  comparisons  would  be  statistically  significant.  Finally,  adjusting  for  program 
length  differences  removed  a  source  of  variance  that  otherwise  could  have  obscured 
differences  between  treatment  options. 

The  failure  to  identify  best  practices  does  not  mean  that  such  practices  do  not 
exist.  Every  analysis  produced  one  option  that  had  a  larger  average  ESRm  than  all  other 
options  for  that  facet.  The  problem  was  that  the  differences  between  the  most  promising 
option  and  other  choices  were  not  large  enough  to  be  statistically  significant.  Although 


Resistance  Meta- Analysis  19 


the  comparisons  have  not  been  reported  in  detail  here,  many  post  hoc  comparisons 
produced  very  small  %2  values  despite  moderately  large  sample  sizes.  The  implication  is 
that  the  available  evidence  would  have  to  be  multiplied  many  times  to  make  the  contrasts 
between  the  design  options  statistically  significant.  If  the  required  data  were  available, 
the  conclusion  still  might  be  that  the  differences  were  too  small  to  be  important.  That 
argument  would  be  particularly  powerful  when  coupled  with  the  knowledge  that  training 
effects  increase  with  program  length.  In  the  final  analysis,  the  difference  between  design 
options  might  amount  to  choosing  between  one  program  that  would  produce  a  given 
effect  in  10  weeks  and  another  program  that  would  require  12  weeks  to  produce  the  same 
results.  It  is  debatable  whether  the  extensive  additional  research  that  would  be  needed  to 
clearly  define  best  options  would  really  have  much  impact  on  program  design  choices. 

A  low  probability  of  identifying  best  practices  at  any  time  in  the  near  future  does 
not  mean  that  resistance  training  research  has  no  value.  Research  has  singled  out  some 
design  options  as  less  effective  than  others.  If  the  typical  equivalence  set  included  more 
than  one  option,  it  is  also  true  that  it  seldom  contained  all  possible  options.  For  example, 
the  overall  evidence  justified  ruling  out  the  use  of  single  set  programs  unless  in  the 
absence  of  design  constraints  that  make  it  impossible  to  incorporate  more  than  one  set  per 
session.  This  point  has  been  debated  in  the  past,  but  the  aggregate  body  of  evidence 
summarized  here  and  in  other  recent  reviews  has  reached  the  point  that  there  is  little 
doubt  that  multiple-set  programs  are  superior  to  single-set  programs.  This  result  suggests 
a  guideline  for  future  studies.  It  may  be  more  productive  to  undertake  studies  designed  to 
rule  out  some  options  than  to  focus  attention  on  identifying  the  best  option.  Here  again, 
though,  the  ultimate  difference  may  be  that  the  less  effective  options  simply  take  longer 
to  reach  program  targets. 

Research  to  date  has  identified  only  one  strong  influence  on  the  training  response. 
Untrained  individuals  have  produced  much  greater  training  effects  than  trained 
individuals  (Rhea  et  al.,  2003;  Payne  et  al.,  1997;  Rhea  et  ah,  2002;  Wolfe  et  al.,  2004). 
Gender  has  had  little  impact  on  the  training  response — a  finding  that  corroborated  earlier 
reviews  that  used  different  inclusion  criteria  and  analytic  methods  (Payne  et  al.,  1997; 
Peterson  et  al.,  2004;  Rhea  et  al.,  2003;  Rhea  &  Alderman,  2004;  Wolfe  et  al.,  2004). 
Whether  age  affects  the  training  response  is  uncertain.  The  age  effects  for  men  and 
women  were  in  the  opposite  direction,  but  the  possibility  cannot  be  ruled  out  that  this 
result  was  due  to  a  few  atypical  male  samples. 

The  smaller  average  response  of  trained  individuals  deserves  further  comment. 
Anecdotal  evidence  and  the  flatter  slope  of  the  program  length-ESRM  equation  for  trained 
individuals  both  suggest  a  slower  rate  of  improvement  for  trained  individuals.  The 
smaller  average  response  of  trained  individuals  has  the  same  implication  because  the 
average  program  length  was  virtually  identical  for  trained  and  untrained  samples.  This 
rate  difference  has  implications  for  evaluating  training  programs  that  have  been  ignored 
to  date.  Training  programs’  effects  accumulate  over  time,  so  research  that  compares 
different  programs  must  continue  long  enough  to  show  the  differences  in  the  cumulative 
training  effects.  Longer  studies  are  needed  to  accurately  evaluate  different  program 
designs  in  trained  populations. 

This  review  has  provided  additional  perspective  on  some  ongoing  resistance 
training  controversies.  The  value  of  single-set  programs  has  been  debated  with  some 
reviewers  favoring  the  conclusion  that  single-set  programs  are  as  good  as  multiple-set 
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programs  (Carpinelli  &  Otto,  1998),  while  others  favor  the  conclusion  that  multiple-set 
programs  are  more  effective  (Galvao  &  Taaffe,  2004;  Rhea  et  ah,  2002;  Wolfe  et  ah, 
2004).  This  summary  of  the  evidence  came  down  in  favor  of  multiple-set  programs. 

This  review  has  also  provided  additional  perspective  on  periodized  training 
programs.  Periodized  programs  were  slightly  less  effective  than  non-periodized  programs 
for  untrained  individuals.  The  two  approaches  produced  comparable  results  in  trained 
individuals.  These  results  appear  to  conflict  with  Rhea  and  Aldennan’s  (2004)  conclusion 
that  periodized  programs  are  effective  regardless  of  training  status,  age,  or  gender.  The 
key  to  the  apparent  conflict  may  be  that  there  is  a  difference  between  knowing  that  a 
program  is  effective  and  knowing  that  it  is  more  effective  than  some  alternative  program. 
Rhea  and  Aldennan  (2004)  analyzed  their  data  in  two  phases.  First,  they  showed  that 
periodized  programs  were  more  effective  than  non-periodized  programs  when  all  of  their 
effect  sizes  were  analyzed  together.  Subsequently,  Rhea  and  Alderman  (2004)  showed 
that  periodized  programs  produced  statistically  significant  gains  for  men  and  women, 
young  and  old,  and  trained  and  untrained  individuals.  The  results  of  the  second  phase  of 
the  analyses  only  lead  to  the  conclusion  that  periodized  programs  are  superior  to  non- 
periodized  programs  for  all  types  of  people  if  the  difference  observed  in  the  initial 
analysis  applies  equally  to  all  subgroups.  The  present  analyses  suggested  that  this 
generalization  cannot  be  taken  for  granted.  Indeed,  the  findings  in  this  review  raised 
doubts  about  the  overall  superiority  of  periodized  programs.  Further  investigation  of  this 
topic  could  be  fruitful,  but  it  is  worth  noting  that  identifying  a  difference  between 
periodized  programs  and  non-periodized  programs  will  not  simplify  the  problem  of 
defining  best  practices.  Appropriate  choices  would  still  be  needed  for  the  number  of 
sessions  per  week,  sets  per  session,  and  repetitions  per  set  will  have  to  be  made  no  matter 
which  approach  is  ultimately  identified  as  the  better  option.  Evidence  favoring 
periodization  would  amplify  these  requirements  by  making  it  necessary  to  specify  a 
choice  for  each  facet  in  each  microcycle,  specify  the  duration  of  each  microcycle,  and  so 
forth. 

Limitations  of  this  review  must  be  considered  when  evaluating  the  findings.  No 
attempt  was  made  to  conduct  ancestry  searches  or  to  systematically  search  for 
unpublished  studies,  so  it  is  likely  that  the  total  volume  of  evidence  could  have  been 
increased.  However,  unless  the  omitted  literature  is  very  large,  additional  evidence  would 
be  very  unlikely  to  change  this  review’s  primary  conclusions.  Furthennore,  some 
program  design  options  have  been  studied  so  infrequently  (e.g.,  4  sessions  per  week  for 
untrained  individuals)  that  it  is  unlikely  that  any  search  would  yield  enough  evidence  to 
reach  strong  conclusions  about  the  utility  of  those  options.  A  third  important  point  is  that 
this  review  relied  on  estimates  of  the  correlation  of  pre -training  test  scores  with  post¬ 
training  test  scores.  Estimates  had  to  be  used  because  training  studies  rarely  report  this 
statistic  or  other  statistics  from  which  it  could  be  derived  (e.g.,  a  sample-specific 
correlated  t  test).  Another  point  to  consider  is  that  the  analyses  relied  on  a  fixed-effects 
statistical  model.  Large  residual  yj  values  indicated  that  adopting  a  random-effects  model 
would  have  been  a  reasonable  course  of  action.  However,  shifting  to  a  random-effects 
model  would  only  have  accentuated  the  central  finding  of  the  study.  The  fixed-effects 
model  overestimates  the  estimated  effect  size  precision  (National  Research  Council, 
1992).  Shifting  from  a  fixed-effects  model  to  a  random-effects  model  would  have 
increased  the  variance  estimate  for  individual  effect  sizes.  The  increased  variance  would 
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have  reduced  the  x~  values  in  the  analyses.  Any  difference  between  design  options  that 
was  not  statistically  significant  in  the  fixed-effects  analysis  would  be  even  less  likely  to 
be  significant  in  the  random-effects  model.  The  use  of  a  random-effects  model  would 
have  been  expected  to  increase  the  size  of  the  equivalence  sets  that  were  the  central 
products  of  this  review.  A  fifth  issue  to  consider  is  that  no  attempt  was  made  to  correct 
for  publication  bias.  Corrections  would  have  resulted  in  a  somewhat  smaller  overall 
effect  size  estimate,  but  the  tests  for  publication  bias  may  have  been  misleading  because 
bias  is  not  the  only  process  that  could  have  generated  the  same  pattern  of  findings  (Tang 
&  Liu,  2000).  Finally,  the  interactions  between  program  design  facets  were  not 
investigated.  A  meaningful  evaluation  of  those  interactions  would  require  data  from 
studies  that  represented  a  wide  range  of  the  possible  design  facet  combinations.  The  fact 
that  some  individual  design  options  have  only  been  studied  in  a  single  sample  made  it 
clear  that  research  to  date  has  not  provided  the  empirical  evidence  needed  to  fully 
evaluate  interactions. 

What  guidelines  are  appropriate  for  designing  resistance  training  programs?  The 
present  failure  to  identify  best  practices  left  this  question  unanswered,  but  the  findings 
provide  a  frame  of  reference  for  an  answer.  The  key  observations  are  that  it  will  usually 
be  the  case  that  several  design  options  will  produce  similar  effects  and  that  training  status 
is  important.  It  follows  that  guidelines  should  present  a  range  of  alternatives  for  the 
number  of  sessions  per  week,  the  number  of  sets  per  session,  and  the  intensity  of 
repetitions  within  sets.  The  choices  among  alternatives  should  be  tailored  to  the  training 
status  of  the  program  participants.  Guidelines  that  meet  these  criteria  and  embody 
informed  professional  judgments  based  on  the  empirical  evidence  are  already  available 
(American  College  of  Sports  Medicine,  2009;  Kraemer  et  ah,  2002).  The  best 
recommendation  for  program  design  at  this  time  is  to  follow  those  guidelines.  It  is 
unlikely  that  better  guidelines  will  be  available  in  the  near  future  given  the  difficulty  of 
providing  convincing  evidence  that  options  within  equivalence  sets  produce 
demonstrably  different  training  outcomes. 
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Appendix  A 

Computing  Effect  Sizes  for  Repeated  Measures 

Meta-analysis  provides  estimates  of  the  average  ES  and  the  variation  of  individual  ES 
estimates  about  that  average.  The  homogeneity  tests  for  variation  about  the  average  are 
especially  important  in  the  present  context.  If  the  ESs  for  different  training  programs 
display  greater-than  chance  variation,  it  is  reasonable  to  search  for  moderator  variables 
that  can  explain  the  observed  heterogeneity.  In  the  present  review,  program  design  facets 
and  demographic  variables  were  of  interest  as  potential  moderator  variables. 

Studies  must  be  assigned  appropriate  weights  to  compute  the  average  ES  and  test 
for  variation  about  the  average.  The  weights  are  based  on  the  precision  of  the  individual 
ES  estimates.  All  studies  reviewed  here  employed  pretest-posttest  research  designs.  In 
such  cases,  the  correlation  of  pretest  scores  with  posttest  scores  affects  the  sampling 
variance  that  is  the  ES  estimate  index  of  precision.  Therefore,  the  pretest-posttest 
correlation  must  be  known  to  derive  sampling  variance  estimates  that  are  suitable  for 
determining  ES  weights.  The  correlation  must  be  known  whether  the  analyses  employ 
standardized  mean  change  scores  or  difference  scores  (Morris,  2000).  For  change  scores, 
the  proper  estimate  of  sample  variance  is: 

°Diff  =  (A1) 

In  this  equation,  the  subscripted  “Diff  ’  indicates  that  the  variable  of  interest  is  a 
difference  score.  The  pretest-posttest  correlation,  r,  is  expected  to  be  positive  and 
moderate  to  large.  As  a  consequence,  the  last  term  of  Equation  A1  will  be  moderate  to 
large  relative  to  the  first  two  terms.  It  follows  that  simply  pooling  the  pretest  and  posttest 
variances,  as  would  be  the  case  if  the  pretest-posttest  correlation  was  ignored,  will  result 
in  overestimation  of  the  true  sampling  variance.  If  the  variance  is  overestimated,  the  z 
scores  associated  with  the  deviation  of  specific  ES  values  from  the  average  ES  will  be 
smaller  than  they  would  if  the  correct  variance  were  used.  The  overall  test  for 
homogeneity  of  ESs,  Cochran’s  Q,  is  the  sum  of  the  squared  z  scores.  Thus, 
overestimating  sampling  variance  will  lead  to  underestimating  Q.  This  bias  in  the  Q  test 
values  could  lead  to  the  erroneous  conclusion  that  a  given  moderator  is  unimportant.  The 
tests  for  moderators  were  central  to  this  review,  so  accurate  variance  estimates  were 
essential. 

The  correct  variance  estimates  could  be  estimated  easily  if  studies  routinely 
reported  the  pre-training/post-training  correlations  for  test  scores.  Unfortunately,  this 
information  is  seldom  reported.  The  required  infonnation  could  be  extracted  from  the  t 
tests  or  F  tests  for  the  time  effect  if  either  statistic  was  reported  separately  for  each 
condition  in  the  study.  Once  again,  resistance  training  studies  seldom  provide  this 
infonnation. 

Similar  problems  have  been  encountered  in  other  meta-analytic  contexts.  An 
analogous  problem  arises  when  research  syntheses  must  adjust  for  measurement  error 
(Hunter  &  Schmidt,  1990).  Because  reliability  estimates  are  reported  only  infrequently 
for  the  primary  studies  in  an  analysis,  reliability  estimates  are  obtained  by  averaging 
those  that  have  been  reported  (e.g.,  Safrit,  Hooper,  Ehlert,  Costa,  &  Patterson,  1988).  The 
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same  approach  can  be  applied  to  the  problem  of  estimating  the  pretest-posttest 
correlations  for  resistance  training  studies. 

Table  A1  presents  average  pretest-posttest  correlations  for  strength  tests 
administered  in  several  resistance  training  studies.  The  averages  are  derived  from  the 
information  in  studies  by  Hagennan  (2001),  Karabulut  (2008),  Konstanty  (1990),  Lucier 
(1999),  Omizo  (1992),  Van  Oosbree  (1993),  and  Womer  (2003).  These  studies  reported 
the  correlation,  the  t  test  or  F  test  for  the  time  effect,  or  the  raw  data  needed  to  compute 
these  statistics.  The  averages  also  included  the  results  of  analyzing  data  from  several 
studies  of  Navy  personnel  (Marcinik,  Hodgdon,  Englund,  &  O’Brien,  1987;  Marcinik, 
Hodgdon,  &  Vickers,  1985;  Marcinik,  1986;  Marcinik,  unpublished  manuscript).  Table 
A1  reports  the  resulting  weighted  averages  for  6  strength  tests  that  were  represented  by  at 
least  three  estimates  of  the  test-retest  correlation.  The  cumulative  sample  sizes  for  those 
tests  ranged  from  290  to  459  observations. 

Table  A1 

Test-by-Test  Estimates  of  the  Pooled  Pre-Post 
Correlation 


Test 

k 

IN 

Mean  r n 

Bench 

10 

497 

.909 

Shoulder  press 

4 

344 

.802 

Lat  pull-down 

3 

290 

.826 

Biceps  curl 

3 

333 

.770 

Leg  press 

6 

459 

.817 

Knee  extension 

4 

112 

.723 

Total 

2120 

.834 

A  meta-analysis  of  the  differences  in  the  correlation  coefficients  for  the  various 
strength  tests  was  conducted  using  the  Fisher  r-to-Z  transfonnation.  The  analysis  showed 
that  the  reliability  estimates  for  the  six  tests  in  Table  A-l  differed  significantly  (y2  = 
71.89,  5  dfp  <  .001).  However,  most  of  the  variation  was  attributable  to  the  difference 
between  the  average  correlation  for  bench  press  and  the  corresponding  statistics  for  the 
remaining  five  tests.  When  the  bench  press  was  eliminated  from  the  analysis,  the 
remaining  differences  were  not  statistically  significant  (x“  =  8.04,  4  dfp>  .090).  The 
average  test-retest  correlation  in  this  subset  of  the  strength  tests  was  r  =  .800,  a  value  that 
was  representative  of  all  five  tests.  Based  on  these  findings,  repeated-measures  effect  size 
computations  proceeded  with  the  test-retest  correlation  for  the  bench  press  set  at  r  =  .90, 
and  the  test-retest  correlation  estimate  for  other  strength  measures  set  at  r  =  .80. 

After  developing  estimates  of  the  pretest-posttest  correlations,  the  analysis 
followed  guidelines  provided  by  Morris  and  DeShon  (2002).  First,  the  variance  for 
individual  observations  was  computed  by  applying  Equation  A-l  above.  Second,  the 
standard  deviation  of  the  differences  (SDdifj)  was  computed  by  taking  the  square  root  of 
the  variance.  This  standard  deviation  was  used  to  compute  the  initial  ESrm  (Equation 
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A2).  A  separate  ES  was  computed  for  each  record  in  the  data  file.  A  record  consisted  of 
the  results  for  a  single  strength  test  administered  to  a  particular  subject  sample. 

The  use  of  an  average  pretest-posttest  correlation  in  many  cases  will  obviously  be 
inaccurate.  However,  these  correlations  clearly  have  been  positive  and  substantial  when 
estimates  have  been  available.  Ignoring  this  strong  trend  would  lead  to  very  conservative 
tests  for  moderator  effects.  The  uncertainty  introduced  by  the  use  of  average  values  was 
preferable  to  having  results  that  certainly  were  too  conservative. 

The  estimated  pretest-posttest  correlation  values  were  combined  with  the  sample 
standard  deviations  to  compute  the  variance  of  the  difference  scores  as  shown  in 
Equation  Al.  The  standard  deviation  of  the  difference  (SDdljj)  was  the  square  root  of  this 
variance.  The  ES  for  repeated  measures  was: 

eSrm  =  ( Meanposl  -  Meanpre )  /  SDdiff  (A2) 

Weighting  ESrm  estimates.  Individual  ES  estimates  must  be  weighted  to  obtain 
the  most  precise  aggregated  ESrm  estimate  and  to  test  for  heterogeneity  in  the  individual 
estimates.  The  appropriate  weights  are  the  inverse  of  the  variance.  The  variance  of  an 
individual  ESrm  estimate  can  be  computed  by  applying  the  equation  for  the  single-group 
pretest-posttest  change  score  variance  formula  in  Table  2  of  Morris  and  DeShon  (2002, 

P-  H7). 


Variance  = 


n  - 1 


yn)yn-l) 


(l  +  nSRM1)- 


f 


V 


(A3) 


In  this  equation,  n  is  sample  size  and  8m  is  the  population  value  for  ESrm-  The  equation 
includes  a  bias  correction,  c,  to  obtain  accurate  variance  estimates.  This  correction  factor 
was  obtained  by  applying  the  approximation  developed  by  Hedges  (1982)  and  given  as 
Equation  23  in  Morris  and  DeShon  (2002,  p.  1 17). 


(4  *df)-\ 


(A4) 


The  variance  computations  required  one  additional  input,  5rm-  Ideally,  this 
parameter  would  be  set  equal  to  the  unknown  population  ES.  An  estimate  of  this 
population  parameter,  t/RM,  was  used  because  the  population  value  can  only  be  estimated 
once  after  the  variance  is  already  known.  Given  this  circularity,  the  recommended 
solution  is  to  compute  the  unweighted  ES  and  use  that  value  for  computing  the  variance 
for  ESrm  (Hedges,  1982;  Morris  &  DeShon,  2002).  The  present  analyses  employed  this 
approach. 

The  variance  of  ESRM  was  computed  by  applying  Equation  A3  after  estimating 
8rm-  The  derivation  of  that  equation  can  be  found  in  the  appendix  to  Morris  and  DeShon 
(2002)  or  in  Gibbons,  Hedeker,  and  Davis  (1993).  The  accuracy  of  the  syntax  used  to 
implement  the  equation  in  the  present  analyses  was  confirmed  by  repeating  the  small 
meta-analysis  given  in  Table  3  of  Morris  and  DeShon  (2002). 


Resistance  Meta-Analysis  25 


Appendix  B 

Selection  of  Strength  Tests  for  Analysis 

For  every  sample,  variants  of  the  average  ESrm  were  the  dependent  variables  in  the 
analyses  reported  in  this  paper’s  main  body.  This  appendix  describes  the  analyses  that 
were  undertaken  to  determine  whether  averaging  was  appropriate. 

Table  B1 

Average  ESrm  for  Different  Strength  Tests 


No. 

Effects 

Cumulative 

Scores 

Mean 

SD 

SE 

Min 

Max 

Bench  Press 

161 

2336 

1.53 

1.15 

0.09 

-0.28 

6.40 

Leg  Press 

91 

1248 

1.98 

1.48 

0.16 

0.18 

7.07 

Squat 

84 

1063 

1.56 

0.95 

0.10 

0.07 

4.22 

Knee/Leg  Extension 

81 

1038 

2.16 

1.27 

0.14 

0.23 

6.09 

Biceps  Curl 

63 

1425 

1.61 

1.05 

0.13 

0.27 

5.20 

Leg  Curl 

44 

630 

1.71 

0.93 

0.14 

0.09 

3.77 

Chest  Press 

38 

588 

2.21 

1.95 

0.32 

0.27 

9.70 

Triceps/Elbow  Ext 

32 

434 

2.24 

1.35 

0.24 

0.66 

5.94 

Lat  Pull-down 

28 

392 

2.27 

1.48 

0.28 

0.76 

7.00 

Mil/Shoulder  Press 

20 

252 

1.80 

1.41 

0.32 

0.32 

5.45 

Miscellaneous  Rowing 

14 

241 

2.41 

1.50 

0.40 

0.83 

6.46 

Miscellaneous  Core 

13 

168 

1.29 

0.52 

0.14 

0.43 

2.01 

Power  Clean 

9 

74 

0.23 

0.43 

0.14 

-0.12 

1.33 

Lateral  Raise 

7 

103 

1.40 

0.33 

0.12 

0.76 

1.67 

Calf  Raise 

6 

98 

2.72 

1.37 

0.56 

1.14 

4.89 

Hip  Abduction 

6 

86 

2.54 

0.87 

0.36 

1.41 

3.51 

Hip  Extension 

6 

85 

3.11 

1.32 

0.54 

1.75 

5.14 

Hip  Adduction 

5 

75 

2.42 

1.02 

0.46 

1.02 

3.67 

Hip  Flexion 

4 

63 

3.59 

0.66 

0.33 

2.80 

4.33 

Chest  Fly 

1 

8 

2.20 

N/A 

N/A 

2.20 

2.20 

Total 

713 

10407 

1.83 

1.29 

0.05 

-0.28 

9.70 

Initial  assessment  of  test  effects  on  ESRM-  The  analyses  began  with  an 
examination  of  the  unweighted  ESrm  values  for  18  specific  strength  tests  and  two 
categories  of  tests  that  were  comparable  in  general  intent,  but  differed  in  the  specifics  of 
the  exercise  (see  Table  Bl).  The  goal  of  this  analysis  phase  was  to  determine  whether  a 
single  average  value  could  be  used  in  the  procedures  described  in  Appendix  A  to 
compute  the  variance  estimates  for  each  individual  test  regardless  of  which  test  was 
considered.  If  the  average  ESRM  for  individual  tests  varied  significantly  across  tests,  the 
use  of  a  test-specific  average  would  be  more  reasonable. 
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The  initial  one-way  ANOVA  indicated  that  a  single  value  was  not  suitable.  The  average 
ESrm  for  different  tests  varied  from  a  minimum  of  0.23  for  power  clean  to  a  maximum 
3.59  for  hip  extension.  Test  differences  were  statistically  significant  (Fi9,693  =  3.56, p  < 
.001)  and  accounted  for  8.1%  of  the  total  variation  in  ESrm  (see  Table  B2). 

Table  B2 

Impact  of  Strength  Test  on  Effect  Size  Estimates 


All  tests 

55 

df 

MS 

F 

Sig 

e1 

Between 

116.69 

19 

6.14 

3.96 

.000 

.081 

Within 

1075.31 

693 

1.55 

Common  tests 

Between 

52.01 

9 

5.78 

3.60 

.000 

.037 

Within 

1013.32 

632 

1.60 

Note.  Common  tests  were  those  represented  20  or  more  effect  sizes  in  the  data.  Only  four  samples  that 
produced  a  total  of  6  ESs  were  lost  from  the  analysis  because  they  had  not  performed  any  of  the  common 
tests. 


The  average  values  for  individual  tests  gave  reason  to  believe  that  a  subset  of  the 
tests  might  have  unduly  inflated  the  overall  differences.  Eight  questionable  tests  were 
represented  by  <10  ESrm  values  (see  Table  Bl).  Seven  of  those  eight  tests  produced 
extreme  averages,  including  the  smallest  average  ESrm  and  the  six  largest  average  ESrm 
values  in  the  table. 

Another  comparison  shed  further  light  on  the  impact  of  the  number  of  ESrm 
estimates  available  to  describe  a  given  strength  test.  The  average  ESrm  for  tests  with  >20 
ESrm  values  ranged  from  1.53  to  2.27.  Only  one  of  the  eight  tests  represented  by  <20 
ESrm  values  produced  an  average  ESRM  within  this  range.  The  other  9  averages  were 
either  less  than  the  lower  bound  of  this  range  or  greater  than  the  upper  bound.  The  test 
with  an  average  ESRM  within  the  range,  chest  fly,  was  represented  by  a  single  sample 
comprised  of  eight  subjects.  The  average  ESrm  value  for  the  two  miscellaneous 
categories  also  fell  outside  the  range  of  averages  for  the  frequently  used  tests. 

Focus  on  frequently  used  tests.  The  second  analysis  phase  examined  the 
feasibility  of  limiting  the  coverage  in  this  review  to  a  subset  of  strength  tests.  The 
rationale  for  this  step  was  that  the  extreme  values  observed  for  tests  that  had  been 
administered  to  <20  samples  might  be  misleading  because  there  was  too  little  data  to 
obtain  accurate  estimates  for  those  tests.  Retaining  the  extreme  averages  could  exaggerate 
the  true  magnitude  of  test  differences. 

The  second  analysis  examined  differences  between  the  10  most  common  strength 
tests.  The  test-to-test  variation  in  test  scores  was  significant  even  when  the  analysis  was 
restricted  to  the  10  most  common  tests  (see  Table  B2).  The  proportion  of  variance 
explained  by  the  test  differences  was  reduced  (3.7%  vs.  8.1%),  but  so  were  the  degrees  of 
freedom  associated  with  this  variance  (9  vs.  19).  The  percentage  of  variance  per  degree  of 
freedom  remained  approximately  constant. 

The  ANOVA  results  provided  the  basis  for  assessing  the  variation  in  ESrm.  The 
assessment  used  the  variance  explained  and  the  number  of  degrees  of  freedom  to 
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compute  an  effect  size  for  the  test-to-test  differences.  The  differences  failed  to  reach  the 
minimum  value  that  Cohen  (1988)  would  classify  as  a  small  effect  in  either  the  analysis 
based  on  20  tests  or  the  analysis  based  on  10  tests.  However,  the  s  for  the  analysis  based 
on  10  frequently  used  tests  was  clearly  below  the  criterion  value  for  a  small  effect  while 
the  s  for  the  analysis  based  on  all  20  tests  was  quite  close  to  the  lower  bound  for  a  small 
effect.  This  difference  focused  subsequent  attention  on  the  10  common  tests. 

At  this  point  in  the  analyses,  it  was  judged  appropriate  to  use  test-specific 
averages  when  computing  ESRM  variance.  Thus,  preliminary  analyses  were  conducted 
with  weights  based  on  variances  computed  with  a  test-specific  Jrm  for  each  ESrm.  The 
results  of  those  preliminary  analyses  were  the  point  of  departure  for  a  third  assessment  of 
the  need  for  test-specific  averages. 

Table  B3 

Strength  Test  Differences  in  Specific  Populations 


55 

df 

MS 

F 

Sig. 

Young  untrained  women 

Between 

51.15 

9 

5.68 

3.19 

.003 

.222 

Within 

106.98 

60 

1.78 

Young  untrained  men 

Between 

20.14 

9 

2.24 

1.73 

.088 

.042 

Within 

181.53 

140 

1.30 

Young  trained  men 
Between 

3.24 

8 

.40 

.46 

.882 

.000 

Within 

108.72 

123 

.88 

Older 

Between 

8.26 

9 

.92 

.50 

.873 

.000 

Within 

314.21 

171 

1.84 

Young  untrained  women 
w/o  extremes 

Between 

16.85 

9 

1.87 

1.82 

.084 

.089 

Within 

58.64 

57 

1.03 

Training  population  effects.  A  third  analysis  was  undertaken  after  preliminary 
analyses  using  weights  based  on  test-specific  averages  showed  that  demographic 
variables  were  strongly  related  to  ESRM.  Given  those  strong  effects,  any  demographic 
variable  confounding  with  strength  test  usage  could  distort  the  effects  of  strength  tests  on 
ESrm-  To  determine  whether  strength  tests  produced  different  average  ESRM  values  when 
demographic  variables  were  held  constant,  the  strength  test  comparisons  were  repeated  in 
four  homogenous  subgroups:  untrained  young  men,  untrained  young  women,  trained 
young  men,  and  older  populations.  The  analysis  was  restricted  to  the  10  most  frequently 
used  tests  to  ensure  that  the  test  results  were  comparable  across  populations  (see  Table 
B3). 
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Table  B4 

Reliabilities  Based  on  Intraclass  Correlations 


No.  tests 

No.  samples 

Estimated  ICC3 

1 

117 

A3515 

2 

116 

.777 

3 

24 

.385 

4 

15 

.731 

5 

9 

.632 

6 

9 

.461 

Note.  The  total  number  of  samples  in  these  analyses  was  only  290  because  intraclass  correlations  were  not 
computed  for  samples  with  7  test  scores  (k  =  4)  or  eight  test  scores  (k  =  4).  The  sample  sizes  for  those 


analyses  were  judged  too  small  to  obtain  reasonable  estimates  of  the  typical  difference  between  samples. 
a  Intraclass  correlation  coefficient  bThe  estimated  single  test  ICC  was  derived  by  applying  the  Spearman- 
Brown  prophecy  formula  (Nunnally  &  Bernstein,  1994,  pp.  262-264)  to  the  ICC  for  samples  with  two  tests. 


Strength  test  was  not  a  significant  predictor  of  ESrm  in  three  of  four  demographic 
groups.  Using  the  method  of  adding  ps  (Rosenthal,  1978),  the  pooled  probability  for  the 
four  groups  was  p  =  .484.  The  data  for  young,  untrained  women  data  then  were 
reanalyzed  with  three  extreme  effects  sizes  (ESrm  >  5.80)  dropped  from  the  analysis.  The 
differences  for  this  group  no  longer  were  statistically  significant  (p  =  .084)  and  the 
pooled  probability  increased  (p  =  .575). 

Further  consideration  of  the  results  from  the  subgroup  analyses  focused  on  the 
variance  explained  by  strength  test  differences.  The  variance  explained  failed  to  meet  the 
minimum  criterion  for  even  a  small  effect  size  for  the  untrained  young  men,  trained 
young  men,  and  older  populations.  The  variance  explained  for  the  untrained  young 
women  exceeded  the  minimum  value  for  a  small  effect  in  the  initial  analyses,  but  fell  just 
below  this  minimum  when  the  three  extreme  samples  were  excluded. 

The  preceding  analyses  indicated  that  there  was  no  reason  to  adopt  different  dim 
values  for  different  strength  tests  provided  attention  was  limited  to  the  10  most  frequently 
used  tests.  An  average  ESrm  based  on  the  10  frequently  used  tests  was  a  reasonable  index 
of  the  impact  of  a  given  training  program. 

Reliability  of  the  average  ESrm ■  If  the  results  for  different  strength  tests  provide 
estimates  of  a  common  ESrm,  the  average  ESrm  should  be  more  reliable  than  a  single 
score.  Intraclass  correlation  coefficients  were  computed  to  detennine  the  reliability  of  the 
aggregated  score  (see  Table  B4).  The  typical  value  indicated  low  to  moderate  reliability 
as  reflected  in  the  weighted  average  of  rxx  =  .670. 

Impact  of  test  selection.  The  analyses  reported  in  this  paper  were  restricted  to  the 
average  ESrm  for  whichever  of  the  10  strength  tests  represented  by  >20  ESrm  estimates 
had  been  administered  to  each  sample.  However,  an  average  could  only  be  computed  for 
samples  from  studies  that  included  at  least  1  of  the  10  frequently  used  tests.  The  decision 
to  use  the  average  eliminated  four  of  the  302  samples  in  this  review.  Restricting  the 
computation  of  average  ESrm  to  those  10  tests  eliminated  10.0%  (71  of  713)  of  the 
ESrmS  and  9.6%  (1,001  of  10,407)  of  the  individual  test  scores.  These  losses  were 
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balanced  by  the  fact  that  the  mean  ES  for  every  remaining  test  was  based  on  a  reasonably 
broad  ESrm  sample  and  a  substantial  minimum  cumulative  sample  size  of  252  test  scores. 

Summary  observations.  Three  comments  are  in  order  regarding  the  decision  to 
use  the  average  ESrm  to  compute  variance  estimates  for  the  analyses  reported  in  the  main 
body  of  this  paper.  First,  the  initial  observations  that  ESrm  varied  as  a  function  of 
strength  test  could  be  explained  by  confounding  of  test  type  with  demographic 
characteristics.  If  demographics  were  the  basis  for  the  original  confounding,  ESrm 
estimates  that  are  independent  of  demographics  are  the  appropriate  values  to  include  in 
the  analyses.  Second,  averaging  effects  within  studies  eliminates  an  important  statistical 
problem.  When  multiple  ES  estimates  are  derived  from  a  single  sample,  those  estimates 
are  not  independent.  Analyses  should  allow  for  the  lack  of  independence  (Gleser  & 

Olkin,  1994).  Using  an  average  ES  is  a  common  method  of  dealing  with  this  problem. 
Ideally,  there  should  be  some  empirical  justification  for  averaging.  Table  B3  provided 
this  justification  for  the  present  purposes.  The  average  ESRM  was  not  significantly 
different  controlling  for  demographic  differences.  Third,  it  can  be  argued  that  all  of  the 
strength  tests  are  indicators  of  a  single  general  strength  construct  (Vickers,  2003).  If  all  of 
the  tests  measure  the  same  construct,  the  average  effect  is  the  best  available  estimate  of 
the  program  impact  on  the  basic  underlying  strength  construct.  Thus,  averaging  the  ESRM 
values  for  different  strength  tests  administered  to  a  particular  sample  provides  an 
empirically,  statistically,  and  conceptually  defensible  index  of  resistance  training  effects. 

Analysis  Weights 

The  decision  to  average  ESrm  values  affected  the  computation  of  the  weights  for 
the  moderator  analyses.  Three  different  sets  of  weights  were  used  at  various  points  in  the 
analyses.  The  first  weight  set  was  test-specific.  Weights  were  computed  using  the 
unweighted  test-specific  average  ESrm  as  dsMfrest)-  For  example,  the  average  ESrm  for 
bench  press  was  used  to  compute  bench  press  weights,  the  average  ESRM  for  biceps  curls 
was  used  to  compute  biceps  curl  weights,  and  so  forth.  This  weight  set  was  employed  in 
the  initial  bivariate  assessments  of  moderator  effects. 

The  second  weight  set  was  computed  after  the  analyses  showed  that  test  results 
could  be  reduced  to  a  single  ESRM  estimate  by  averaging  the  values  for  the  10  most 
frequently  used  tests.  Average  values  were  computed  for  each  sample  in  the  analysis.  The 
average  was  based  on  one  to  eight  tests.  The  unweighted  mean  value  for  the  average 
(ESrm,  6?rm  =  1.81)  was  used  in  these  computations. 

The  third  weight  set  allowed  for  the  effect  of  averaging  within  samples.  An 
average  score  should  be  a  more  accurate  estimate  of  the  training  effect  than  an  individual 
score.  Averaging  ESRM  was  analogous  to  the  effect  of  averaging  several  items  in  a 
questionnaire  to  obtain  an  individual’s  scale  score.  If  the  average  is  derived  from 
independent  observations,  the  variance  of  the  mean  is  the  sum  of  the  individual  variances 
divided  by  the  square  of  the  number  of  observations.  The  third  weight  set  was  computed 
to  obtain  the  variance  associated  with  the  second  weight  set  by  k2  where  k  was  the 
number  of  effects  that  were  being  averaged  for  the  sample.  Note  that  this  correction  is 
conservative  because  it  does  not  allow  for  the  correlation  of  scores  from  different  tests. 

The  inverse  of  the  variances  served  as  the  weight  variable  for  the  moderator 
analyses.  The  third  weight  set  was  used  in  the  analyses  described  in  the  main  body  of  this 
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report.  All  of  the  analyses  had  the  average  ESrm  or  the  adjusted  average  ESrm  as  the 
dependent  variable.  The  choice  of  which  weight  to  use  was  of  more  importance 
conceptually  than  in  practice.  The  three  weight  sets  were  highly  correlated  (r  >  .964). 
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Appendix  C 
Tucker-Lewis  Index 


The  TLI  (Tucker  &  Lewis,  1973)  was  introduced  to  guard  against  what  may  be  a 
wide  spread  problem  in  meta-analysis.  Moderator  analyses  begin  with  a  significance  test. 
If  the  null  hypothesis  is  rejected,  the  moderator  variable  is  accepted  as  a  meaningful 
influence  on  ES  even  if  the  differences  between  groups  are  quite  small. 

Relying  on  significance  tests  to  identify  important  results  is  a  risky  proposition  in 
any  statistical  analysis.  Statistical  significance  is  the  product  of  sample  size  and  ES 
(Rosenthal  &  Rosnow,  1984).  In  meta-regression,  ES  might  be  labeled  “meta-ES” 
because  it  reflects  the  differences  in  the  primary  ES  across  moderator  groups.  A 
significant  meta-ES  could  indicate  a  substantial  between-groups  difference,  but  it  does 
not  rule  out  the  possibility  that  small  between-groups  difference  have  been  amplified  by  a 
large  sample  size.  Although  it  follows  that  the  meta-ES  must  be  separated  from  sample 
size  to  properly  interpret  findings,  this  principle  is  not  routinely  applied  to  meta-analysis 
even  though  logic  says  it  should  be. 

The  TLI  was  adapted  to  provide  a  meta-ES  metric.  The  TLI,  which  is  the 
proportion  of  greater  than  chance  variation  in  ES,  can  be  computed  from  the  %  values 
from  a  moderator  analysis.  The  variation  in  ESrm  detennines  the  %2  values.  The  TLI 
equation  is: 


TLI  = 


( Xmu  I  4f  Null  )  (z  Model  l  df  Model ) 


( Xmu  ldfNull)  1 


(Cl) 


2 

The  expected  value  of  x  /df  ratio  is  1,  so  the  denominator  of  Equation  Cl  is  the 
proportion  of  the  observed  variation  in  ESrm  that  is  greater  than  expected  by  chance.  The 
numerator  is  the  variation  in  ESrm  accounted  for  by  the  model  (i.e.,  total  ESrm  variation 
minus  the  residual  ESrm  variation  after  fitting  the  model). 

The  TLI  is  a  reasonable  index  of  the  meta-regression  ES.  This  effect  size  index 
makes  use  of  the  noncentrality  parameter  (McDonald  &  Marsh,  1990)  that  is  the  basis  for 
the  Q  statistic  that  provides  the  test  for  homogeneity  of  variance  (Hedges,  1982).  Thus,  a 
meta-regression  effect  size  based  on  TLI  maintains  a  connection  between  effect  size  and 
the  probability  that  a  moderator  will  be  statistically  significant. 

The  TLI  is  not  an  exact  parallel  to  the  usual  effect  size  indicators — such  as  the 
proportion  of  variance  explained  in  an  ANOVA.  One  reason  is  that  TLI  is  analogous  to 
Hays’s  (1963)  ro  rather  than  the  usual  sL  The  difference  between  the  two  is  that  the 
variance  that  would  be  expected  by  chance  is  subtracted  from  the  variance  explained 

when  computing  oL  but  not  when  computing  sL  This  difference  is  the  reason  that  TLI 

2  2 

will  be  less  than  zero  when  x  Model  l  dfModel  >  %NuU  /  dfNull  because  the  numerator  will  be  a 

negative  number.  This  situation  arises  when  the  reduction  in  the  x“  produced  by  a  model 
is  small  relative  to  the  number  of  parameters  in  the  model.  For  this  reason,  the  reported 
TLI  is  the  value  derived  from  Equation  Cl  or  .00,  whichever  is  larger. 

The  interpretation  of  the  TLI  employed  Cohen’s  (1988)  general  criteria  for  ES 
evaluations.  Cohen’s  criteria  classify  ESs  on  the  basis  of  the  proportion  of  observed 
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variation  explained  by  a  predictor.  In  this  case,  TLI  is  the  proportion  of  non-random 
variation  in  ESRM,  so  Cohen’s  (1988)  ES  classification  rule  is  a  suitable  index  for 
characterizing  the  strength  of  association  of  moderator  variables  with  ESrm:  small  meta- 
ES,  .01  <  TLI  <  .10;  moderate  meta-ES,  .10  <  TLI  <  .25;  large  meta-ES,  TLI  >  .25). 
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Appendix  D 

Analyses  With  Individual  Effect  Sizes  as  Independent  Observations 

An  initial  series  of  moderator  analyses  was  undertaken  with  each  ESrm  treated  as  an 
independent  observation.  This  approach  was  adopted  because  it  has  been  routinely  used 
in  other  resistance  training  meta-analyses  of  simple  ES.  Its  inclusion  provided  a  reference 
point  for  assessing  the  impact  of  shifting  to  a  repeated  measures  effect  size.  The  weights 
used  in  this  analysis  were  based  on  test-specific  values  for  the  average  ESrm. 

Demographic  characteristics  generally  had  a  stronger  impact  on  individual  ESrm 
values  than  did  program  characteristics  (see  Table  Dl).  The  much  larger  %  values  in  this 
analysis  compared  with  the  analysis  of  weighted  average  effects  (see  Table  3)  was  the 
most  notable  aspect  of  the  findings.  The  impact  of  training  status  was  particularly 
pronounced.  The  strength  of  the  training  status  effect  is  illustrated  by  the  fact  that  the  %2 
for  this  moderator  was  not  much  less  than  the  sum  of  the  yjs  for  all  other  moderators. 
Allowing  for  the  degrees  of  freedom  in  each  analysis,  the  choice  between  1  set  per 
session  and  3  sets  per  session  was  the  most  important  program  determinant  of  the 
response. 

Table  Dl 

Based  on  Individual  ESrm  Values 


Moderator 

Residual 

x2 

df 

x2 

df 

Demographics 

Gender 

134.92 

1 

4096.18 

532 

Age  group 

111.42 

1 

5589.81 

727 

Training  status 
Program  elements 

714.41 

1 

5375.78 

692 

Periodization 

1.19 

1 

5491.45 

712 

Sessions  per  week 

109.80 

4 

5514.89 

710 

Sets  per  session 

199.89 

5 

3980.22 

507 

Repetitions  per  set 

317.74 

6 

3385.66 

432 

A  second  analysis  of  the  individual  ESs  was  carried  out  with  no  weights.  The 
analyses  were  simple  ANOVA  procedures  with  a  moderator  variable  defining  the  groups 
in  the  analysis.  The  simple  ESs  in  this  analysis  were  computed  by  subtracting  the  pretest 
mean  from  the  posttest  mean  and  dividing  by  the  pretest  standard  deviation.  This  analysis 
procedure  appeared  to  replicate  that  which  was  used  in  meta-analyses  by  Rhea  and  his 
colleagues  (Peterson  et  al.,  2005;  Rhea  &  Aldennan,  2004;  Rhea  et  ah,  2002;  Rhea  et  ah, 
2003). 
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Table  D2 

Analysis  of  Unweighted  Simple  ES 


Moderator  variable 
Gender 

No.  ES 

AvgES 

Atest 

Sig.  s2 

Men 

362 

1.36 

Women 

162 

1.93 

36.69 

.000  .037 

Age 

Younger 

497 

1.16 

Older 

222 

1.38 

84.29 

.000  .072 

Training  status 

Untrained 

512 

1.67 

Trained 

175 

.77 

105.15 

.000  .090 

Periodization 

Progressive 

505 

1.53 

Periodized 

199 

1.06 

31.39 

.000  .028 

Sessions  per  week 

1 

19 

.58 

2 

211 

1.05 

3 

448 

1.66 

4 

20 

.69 

5 

4 

1.17 

78.31 

.000  .067 

Sets  per  session 

1 

87 

.90 

2 

29 

.99 

3 

330 

1.79 

4 

21 

1.17 

5 

22 

1.09 

6 

10 

1.16 

74.21 

.000  .087 

Repetitions  per  set 

<60% 

15 

2.27 

60% 

16 

1.63 

65% 

9 

1.77 

70% 

17 

1.21 

75% 

177 

1.11 

80% 

127 

1.95 

85% 

63 

1.62 

64.45 

.000  .089 

Note.  The  average  effect  size 

was  smaller  because  the  computations  did  not  allow  for  the  pretest/posttest 

correlation.  Most  of  the  effects  were  highly  significant  (p 

<  .001),  but  the  average  ES 

were  uniformly  in  the 

small  ES  range. 


The  results  of  the  unweighted  analysis  of  individual  ESs  were  broadly  comparable  to  the 
results  of  the  other  analyses  conducted  in  this  study  (see  Table  D2).  The  analyses 
indicated  significantly  larger  ESs  for  women,  older  people,  untrained  individuals.  The 
finding  that  non-periodized  programs  produced  larger  gains  than  periodized  programs 
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conflicted  with  Rhea  and  Aldennan’s  (2004)  findings.  ES  did  not  display  any  simple 
pattern  of  associations  to  sessions  per  week,  sets  per  session,  or  repetitions  per  set. 

The  analysis  of  unweighted  simple  ESs  was  extended  to  illustrate  that  the 
problems  introduced  by  confounding  of  different  moderator  variables  could  yield 
misleading  results  in  this  type  of  analysis  as  well  as  in  the  analyses  reported  in  the  main 
body  of  this  paper.  A  two-way  analysis  of  variance  was  performed  with  training  status 
and  periodization  as  predictors  of  ES.  Experience  was  strongly  related  to  ES  as  it  had 
been  in  the  bivariate  analyses  (F  =  60.89,  p  <  .001,  s2  =  .083).  The  difference  between 
untrained  and  trained  weightlifters  was  largely  unchanged  (untrained,  ES  =  1.63;  trained 
weightlifters,  ES  =  .70).  However,  periodization  was  no  longer  a  significant  predictor  of 
ES  ( F  =  .80 ,p=  .798,  s'  =  .001).  The  difference  between  progressive  and  periodized 
programs  was  reduced  from  d  =  0.47  to  d  =  0. 10  (progressive,  ES  =  1 .22;  periodized,  ES 
=  1.12).  Based  on  this  analysis,  the  confounding  of  periodization  with  experience 
explained  the  bivariate  association  of  periodization  with  ES. 
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Appendix  E 

Comparison  With  Findings  of  Other  Meta-Analytic  Reviews 

Research  generates  a  body  of  evidence.  By  summarizing  that  evidence,  meta¬ 
analyses  indicate  which  findings  should  be  taken  as  established  facts.  Discussions  that 
establish  the  proper  interpretation  of  those  facts  convert  the  evidence  to  reliable  scientific 
knowledge  (Ziman,  1978). 

No  single  meta-analysis  is  likely  to  establish  a  definitive  set  of  facts  within  a 
given  research  domain.  Uncertainty  is  unavoidable  because  meta-analyses  share  some 
attributes  with  primary  studies.  The  studies  in  a  meta-analysis  are  only  a  sample  from  a 
universe  of  potential  studies.  The  choice  of  characteristics  to  code,  coding  criteria, 
method  of  computing  ES,  and  choice  of  analysis  procedures  are  analogous  to  research 
measurements  and  analysis  decisions  that  are  made  in  primary  studies.  Considered  in  the 
abstract,  meta-analyses  have  much  in  common  with  survey  studies.  As  in  survey  studies, 
different  reviews  can  produce  different  results  because  each  review  involves  decisions  at 
multiple  choice  points  (Wanous,  Malinek,  &  Sullivan,  1989)  and  choices  will  differ  from 
one  review  to  the  next. 

The  parallels  between  the  design  and  analysis  of  surveys  and  meta-analyses  mean 
that  developing  a  feeling  for  the  robustness  of  findings  from  meta-analyses  is  important. 
This  undertaking  is  loosely  comparable  to  replicating  survey  findings.  The  study  samples 
can  be  expected  to  overlap  from  one  meta-analysis  of  a  topic  to  the  next,  but  the  samples 
seldom  are  identical.  The  introduction  of  adjustments  for  program  duration  and  repeated 
measures  are  examples  of  choices  that  can  be  seen  as  yielding  a  survey  that  involves 
replication  with  extensions  of  the  earlier  work.  To  develop  the  required  feeling  for  how 
these  modifications  affected  the  meta-analytic  results,  the  findings  from  this  review  were 
compared  with  the  findings  from  reviews  by  Peterson  et  al.  (2004),  Rhea  and  Alderman 
(2004),  Rhea  et  al.  (2002);  Rhea  et  al.  (2003),  and  Wolfe  et  al.  (2004).  The  present 
findings  were  not  compared  to  the  results  of  two  prior  reviews  of  the  effectiveness  of 
resistance  training  in  children  (Falk  &  Tenenbaum,  1996;  Payne  et  al.,  1997).  The  present 
findings  could  not  be  compared  to  those  of  Peterson,  Rhea,  and  Alvar  (2005)  because  that 
the  findings  of  that  review  were  only  reported  graphically. 

Comparison  Procedures 

The  comparisons  with  other  reviews  focused  on  the  assessment  of  moderator 
effects.  Three  questions  were  posed  for  each  moderator:  Is  the  moderator  effect 
statistically  significant?  Is  the  effect  large  enough  to  be  important?  Does  the  evidence 
identify  a  single  best  practice  training  option?  The  last  question  was  equivalent  to  asking 
“Does  the  equivalence  set  contain  a  single  option?”  Affirmative  answers  to  all  three 
questions  would  identify  a  robust  best  practice. 

Each  comparison  involved  three  analyses.  First,  the  means  and  standard 
deviations  reported  in  prior  meta-analyses  were  reanalyzed.  These  reanalyses  were 
needed  to  estimate  the  amount  of  variance  explained  by  moderators  and  to  perfonn  post 
hoc  comparisons  to  define  equivalence  sets.  Second,  the  methods  used  in  the  comparison 
review  were  applied  to  the  studies  covered  in  this  review.  The  unweighted  ES  was  the 


Resistance  Meta-Analysis  37 


dependent  variable  in  an  ANOVA  that  treated  each  test  administered  to  a  sample  as  an 
independent  observation.  Third,  the  ANOVAs  were  repeated  with  the  individual  ESrm 
values  replacing  ES  with  weights  based  on  test-specific  average  effects  (see  Appendix 
B).  The  senior  author’s  last  name  has  been  used  to  label  reanalyses  of  results  from  prior 
reviews.  The  analyses  of  the  simple  ES  and  ESrm  from  this  review  have  been  labeled 
ANOVA  and  GLM  (i.e.,  general  linear  model)  to  indicate  the  analysis  methods. 

Equivalence  sets  were  defined  in  three  steps.  First,  the  statistical  significance  of 
the  moderator  effect  was  detennined  by  ANOVA  or  GLM.  Second,  the  variance 
explained  by  the  moderator  variable  was  computed.  The  effect  size  computations 
produced  rf  values  for  each  moderator.  This  statistic  was  used  for  all  moderator  analyses. 
Third,  post  hoc  comparisons  were  perfonned. 

The  post  hoc  comparisons  began  by  ranking  the  cells  defined  by  the  moderator 
from  largest  average  ES  to  smallest  average  ES.  The  difference  between  the  highest 
average  and  the  second  highest  average  was  tested  for  statistical  significance.  If  the 
difference  was  statistically  nonsignificant,  the  difference  between  the  highest  average  and 
the  third  highest  average  was  tested  for  significance.  The  sequence  of  tests  continued 
until  a  significant  difference  was  identified.  Fisher’s  least  significant  difference  was  the 
significance  criterion  for  the  analyses  of  the  simple  ES  outcome  measures.  A  Bonferroni 
adjustment  was  used  in  the  ESrm  analyses.  The  Bonferroni  adjustment  assumed  that  the 
moderator  level  with  the  highest  average  would  be  compared  to  each  other  level  of  the 
moderator.  The  equivalence  sets  consisted  of  the  moderator  level  with  the  highest 
average  ES  plus  the  other  levels  that  were  not  significantly  different  from  this  reference 
point. 

Each  data  analysis  from  this  review  was  performed  after  imposing  selection 
criteria  that  matched  the  data  as  closely  as  possible  to  the  reference  review.  For  example, 
Peterson  et  al.  (2004)  limited  their  review  to  athletes.  The  corresponding  analyses  of  the 
present  data  were  limited  to  data  from  studies  of  competitive  weightlifters  or  other  types 
of  athletes  who  routinely  engage  in  resistance  training  (e.g.,  football  players). 

Rhea  et  al.  (2003)  Comparison  -  Dose  Response  for  Strength  Development 

Rhea  et  al.  (2003)  provided  the  broadest  picture  of  the  evidence  for  any  of  the 
comparison  reviews.  Their  review  covered  140  studies  that  produced  1433  ESs. 
Preliminary  analyses  showed  a  significant  difference  in  training  response  between  trained 
and  untrained  individuals  and  no  difference  for  men  and  women.  Based  on  those 
findings,  other  potential  moderators  were  examined  separately  for  trained  and  untrained 
individuals  with  samples  of  men  and  women  within  each  category  (see  Table  El).  The 
corresponding  analyses  of  the  present  data  included  a  similar  division  into  trained  and 
untrained  individuals  with  separate  analyses  for  each  group. 
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Table  El 

Rhea  et  al.  (2003)  Comparison 


Moderator  Analysis  Significance 

o2 

Equivalence  set 

Trained  Intensity 

Rhea3 

.000 

.168* 

{80%} 

ANOVAb 

.001 

.327* 

{80%,  60%} 

GLMC 

.011 

.169* 

{60%,  85%,  80%} 

Sets 

Rhea 

.187 

{4,  5,  3,  2,1} 

ANOVA 

.399 

{4,  3,  5,1} 

GLM 

.000 

.162* 

{4} 

Sessions 

Rhea 

.000 

.100* 

{2}d 

ANOVA 

.004 

.078* 

{3} 

GLM 

.731 

.002 

{3,2,  4,1} 

Untrained 

Intensity 

Rhea 

.000 

.033 

{60%,  75%,  40%,  80%,  50%} 

ANOVA 

.000 

.073 

{<60%,  60%,  80%,  85%,  65%} 

GLM 

.000 

.093 

{85%,  <60%,  60%,  65%} 

Sets 

Rhea 

.000 

.035 

{4,3,2} 

ANOVA 

.000 

.086* 

{3,5} 

GLM 

.000 

.074* 

{3,5} 

Sessions 

Rhea 

.000 

.014 

{3} 

ANOVA 

.013 

.021 

{3} 

cln  1  •  p  m  i  1 

GLM 

.000 

b  A  X  T  A  T  K  r 

.019 

{2} 

aggregated  ES  from  the  present  review.  dDichotomous  comparison. 
*Moderator  effect  exceeds  Cohen’s  (1988)  criterion  for  a  small  effect. 
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Table  E2 


Wolfe  et  al.  (2004)  Comparison 

o2 

Moderator 

Analysis 

Sets 

Moderator 

S  x  M  interaction 

Age 

Wolfe 

.000 

.206** 

.031 

ANOVA 

.038 

.004 

.016** 

GLM 

.033 

.000 

.001 

Length 

Wolfe 

.008 

.021 

.082** 

ANOVA 

.032** 

.008* 

.000 

GLM 

.035** 

.005** 

.003 

Men  vs.  women 

Wolfe 

.004 

.010# 

.011 

ANOVA 

.000 

.027* 

.001 

GLM 

.006 

.002 

.000 

Training  status 

Wolfe 

.002 

.004 

ANOVA 

.011* 

.035** 

.005 

GLM 

.012 

.052** 

.001 

Note.  # p  =  .099;  Cohen’s  (1988)  criteria  would  classify  every  significant  difference  as  representing  a 
small,  but  potentially  important  effect. 

*p<. 05.  **p  <  .01 


Statistical  significance.  Intensity  was  the  only  moderator  that  was  consistently 
significant  for  trained  individuals.  All  three  moderators  were  consistently  significant  for 
untrained  individuals. 

Variance  explained.  The  intensity  effect  for  trained  individuals  was  the  only 
moderator  that  consistently  exceeded  Cohen’s  (1988)  minimum  criterion  for  a  small 
effect. 

Equivalence  sets.  There  was  no  equivalence  set  that  was  consistent  across  the 
three  analyses. 
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Wolfe  et  al.  (2004)  Comparison  -  Single  Versus  Multiple  Sets 

Wolfe  et  al.  (2004)  reported  the  results  of  a  review  that  covered  16  studies  with  103  ESs 
(see  Table  E2).  Means  and  standard  deviations  were  reported  for  two-way  classifications 
Single  set  versus  multiple  sets  was  a  factor  in  each  classification.  The  other  factors  were 

Table  E3 


Rhea  et  al.  (2002)  Comparison:  Single  Versus  Multiple  Sets  Review 


Moderator  Significance 

o2 

Equivalence  set 

Training  status  Rhea 

.061 

.062* 

{Trained,  untrained} 

ANOVA 

.000 

.090* 

{Untrained} 

GLM 

.000 

.133* 

{Untrained} 

Length 

Rhea 

.287 

.027 

{11-15,21-25,6-10} 

(in  weeks) 

ANOVA 

.000 

.045 

{21-25} 

GLM 

.000 

.035 

{21-25} 

*Exceeds  Cohen’s  (1988)  criterion  for  classification  as  a  small  effect. 


gender,  age,  training  status,  and  program  length.  Wolfe  et  al.  (2004)  reported  the  mean 
and  standard  deviation  for  each  cell  defined  by  each  two-way  classification.  The  cells  in 
each  cross-classification  were  analyzed  as  a  one-way  ANOVA.  The  reanalysis 
reconfigured  the  cells  within  each  classification  as  a  two-way  ANOVA.  Wolfe  et  al. 
(2004)  did  not  indicate  any  selection  criteria,  so  the  reanalyses  were  carried  out  using  all 
of  the  ESs  in  the  present  data. 

Table  E2  reports  rf  values  for  the  main  effects  and  the  interaction  effect.  The  rf 
values  are  based  on  the  variance  explained  after  controlling  for  the  other  effects  in  the 
table  (i.e.,  unique  sums  of  squares).  Because  each  main  effect  and  interaction  involved  a 
single  degree  of  freedom,  any  r\2  >  .010  met  Cohen’s  (1988)  minimum  criterion  for  a 
potentially  important  influence  on  the  training  response. 

Training  status  was  the  only  consistent  effect.  Although  not  shown  in  the  table, 
the  average  values  indicated  that  the  training  effect  was  much  stronger  for  untrained  than 
trained  individuals 

Rhea  et  al.  (2002)  -  Multiple  Sets 

Rhea  et  al.  (2002)  provided  a  different  approach  to  comparing  single  and  multiple 
set  programs  (see  Table  E3).  A  set  of  16  relevant  studies  produced  93  ESs.  The  ES 
computations  treated  multiple  sets  as  the  experimental  group  and  single  sets  as  the 
control  group.  The  ES  was  based  on  the  difference  between  the  changes  observed  in  the 
two  conditions.  This  review  was  the  only  one  that  based  ES  on  the  difference  in  the 
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improvements  produced  by  two  training  programs.  The  current  data  set  included  some 
studies  for  which  a  single  set  program  could  have  been  matched  to  a  multiple  set  program 
in  the  same  study.  However,  this  matching  was  not  undertaken.  Instead,  the  analyses  of 
the  present  data  approximated  the  Rhea  et  al.  (2002)  analysis  by  using  two-way 
ANOVAs  to  estimate  the  effects  of  training  status  and  program  length  controlling  for 
single  versus  multiple  sets. 

Statistical  significance.  Training  status  was  a  consistent  finding  even  though  it 
was  only  marginally  significant  in  the  present  reanalysis  of  Rhea  et  al.’s  (2003)  data  (see 
Table  E3).  The  effect  of  training  status  just  reached  statistical  significance  in  the  original 
analyses,  F\^  =  4.03,  p  <  .0497.  The  difference  may  be  the  result  of  rounding  the  means 
and  standard  deviations  to  two  decimal  places  when  the  initial  findings  were  reported. 

Variance  explained.  Training  status  met  Cohen’s  (1988)  minimum  criterion  for  a 
small  ES. 

Equivalence  sets.  Untrained  individuals  consistently  displayed  greater  training 
effects  than  trained  individuals. 

Rhea  and  Alderman  (2004)  Comparison 

Rhea  and  Aldennan  (2004)  reviewed  the  effects  of  periodized  resistance  training 
programs.  Their  review  covered  data  from  105  studies  that  produced  at  least  649  ESs  for 
the  moderator  analyses. 

Rhea  and  Aldennan  (2004)  performed  a  test  of  the  overall  difference  between 
periodized  and  non-periodized  programs  as  the  first  step  in  their  analysis  (see  Table  E4). 
When  that  comparison  indicated  statistically  significant  differences,  they  chose  to  include 
only  the  effects  from  periodized  programs  in  subsequent  moderator  analyses  of 
moderator  variables. 

The  difference  between  periodized  and  non-periodized  programs  was  statistically 
significant  and  large  enough  to  be  important  in  Rhea  and  Aldennan’ s  (2004)  analysis  and 
when  the  same  methods  were  applied  to  the  present  data.  However,  the  periodization 
effect  was  not  even  statistically  significant  in  the  GLM  analysis.  This  inconsistency 
suggested  that  the  choice  of  analysis  procedures  was  an  important  factor  in  these 
analyses.  Subsequent  moderator  tests  were  limited  to  just  those  ESs  from  groups  that 
completed  periodized  programs. 

Statistical  significance.  Gender,  age,  training  status,  and  program  length  were 
consistently  significant  moderators  of  ES. 

Variance  explained.  Gender  and  training  status  consistently  met  the  Cohen 
criterion. 

Equivalence  sets.  Untrained  individuals  consistently  displayed  stronger  training 

effects. 
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Table  E4 


Rhea  and  Alderman  (2004)  Comparison:  Meta-Analyses  for  Periodization 


Review  Significance 

o2 

Equivalence  set 

Overall 

Rhea 

.000 

.013* 

{Periodized,  non-periodized}a 

ANOVA 

.000 

.028* 

{Non-periodized} 

GLM 

.275 

.000 

{Non-periodized,  periodized} 

Within  periodized 

programs 

Gender 

Rhea 

.000 

.132* 

{Combined}11 

ANOVA 

.000 

.095* 

{Women} 

GLM 

.000 

.136* 

{Women} 

Age  (in  years) 

Rhea 

.000 

.020 

{<55}c 

ANOVA 

.050 

.019 

{>55} 

GLM 

.029 

.003 

{>55,  <55} 

Training  status 

Rhea 

.000 

.110* 

{Untrained} 

ANOVA 

.000 

.115* 

{Untrained} 

GLM 

.000 

.336* 

{Untrained} 

Program  length 

Rhea 

.019 

.012 

{9-20,  1-8,  20-40  weeks} 

ANOVA 

.000 

.142* 

21-25  weeks 

GLM 

.000 

.055 

{21-25,  26  -  40  weeks} 

aPeriodized  would  have  been  the  only  choice  if  the  group  labeled  “Overall”  had  been  omitted. 
bMen  and  women  combined  into  a  single  sample.  Men  and  women  would  have  been  assigned 
to  a  single  group  if  the  analysis  had  been  limited  to  those  two  groups.  cModerator  variable 
was  a  dichotomy. 
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Peterson  et  al.  (2004)  -  Dose  Response  Relationship  for  Athletes 

Peterson  et  al.  (2004)  examined  the  dose  response  relationship  for  athletes  only. 
The  review  covered  37  studies  with  370  ESs  (see  Table  E5).  The  closest  comparison 
group  in  this  analysis  was  the  trained  weightlifters. 

Table  E5 


Comparison  of  Dose  Response  Analyses  for  Athletes 


Analysis  Significance 

o2 

Equivalence  set 

Intensity  Peterson  .000 

.143* 

{85%,  75%,  80%} 

ANOVA  .001 

.327* 

{80%,  60%) 

GLM  .000 

.184* 

{<60%,  85%,  75%} 

Sets  Peterson  .027 

.053 

{8,  14,  4,  12,  6,5,16,3,1} 

ANOVA  .399 

.044 

{4,3,5,  1} 

GLM  .042 

.062* 

(1,3,  4,  5} 

Sessions  Peterson  .926 

.000 

{2,3} 

ANOVA  .004 

.078* 

{3} 

GLM  .000 

.099 

{3,2} 

Statistical  significance.  Intensity  was  the  only  consistently  significant  moderator. 

Variance  explained.  Intensity  consistently  met  Cohen’s  (1988)  minimum  ES 
criterion. 

Equivalence  sets.  None  of  the  equivalence  sets  were  identical.  In  fact,  there  was 
not  a  single  intensity  that  was  included  in  all  3  sets. 

Comparison  Summary 

The  three  questions  that  guided  the  comparisons  provided  a  sequential  screening 
process  to  identify  best  practices.  A  total  of  24  comparisons  were  made  if  the  interaction 
tenns  for  the  Wolfe  et  al.  (2004)  comparison  are  excluded  from  consideration.  Ten  of  24 
moderator  effects  were  significant  in  all  three  analyses.  Only  five  of  the  10  consistently 
significant  moderator  effects  also  explained  enough  variance  to  meet  Cohen’s  (1988) 
minimum  criterion  for  a  small  effect  size.  In  three  of  the  10  cases,  the  variance  explained 
was  consistently  less  than  Cohen’s  (1988)  minimum  criterion.  Finally,  only  two  of  the 
five  cases  that  met  the  first  two  criteria  consistently  produced  the  same  equivalence  set. 
In  both  cases,  the  consistent  equivalence  set  consisted  solely  of  untrained  individuals. 
Given  that  training  status  is  not  ordinarily  thought  of  as  an  integral  part  of  the  program 
design,  there  was  not  a  single  instance  of  a  best  practice  as  the  tenn  has  been  defined  in 
this  report.  It  apparently  has  been  easy  to  obtain  a  statistically  significant  moderator 
effect,  but  hard  to  move  from  there  to  the  identification  of  a  single  best  design  option 
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when  variance  explained  and  statistical  significance  criteria  are  applied  before 
designating  the  nominal  best  practice. 
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Silver  Spring,  MD  20910-7500  Jacksonville,  FL  32213-0140 


3.  DATES  COVERED  (from  -  to) 

Apr  09  -  Sep  09 

5a.  Contract  Number: 

5b.  Grant  Number: 

5c.  Program  Element: 

5d.  Project  Number: 

5e.  Task  Number: 

5f.  Work  Unit  Number:  60704 


8.  PERFORMING  ORGANIZATION  REPORT 
NUMBER 

Report  No.  10-21 

10.  Sponsor/Monitor's  Acronyms(s) 

NMRC/NMSC _ 

11.  Sponsor/Monitor's  Report  Number(s) 


